library(tidyverse)
## Warning: package 'ggplot2' was built under R version 4.3.3
## Warning: package 'stringr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(e1071)
## Warning: package 'e1071' was built under R version 4.3.2
library(knitr)
## Warning: package 'knitr' was built under R version 4.3.2
library(stats)
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(mice)
## Warning: package 'mice' was built under R version 4.3.2
## 
## Attaching package: 'mice'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
library(GGally)
## Warning: package 'GGally' was built under R version 4.3.3
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(caret)
## Warning: package 'caret' was built under R version 4.3.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

Pick one of the quanititative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right! Pick the dependent variable and define it as Y.

1- Probability.

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 3d quartile of the X variable, and the small letter “y” is estimated as the 2d quartile of the Y variable. Interpret the meaning of all probabilities. In addition, make a table of counts as shown below. a. P(X>x | Y>y) b. P(X>x, Y>y) c. P(X<x | Y>y)
x/y <=2d quartile >2d quartile Total <=3d quartile >3d quartile
Total

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 3d quartile for X, and let B be the new variable counting those observations above the 2d quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

Answer Question #1

First, load the data.

train <- read.csv("C:\\Users\\shaya\\OneDrive\\Documents\\repos\\Data605\\Final_Project\\house-prices-advanced-regression-techniques\\train.csv")
eval <- read.csv("C:\\Users\\shaya\\OneDrive\\Documents\\repos\\Data605\\Final_Project\\house-prices-advanced-regression-techniques\\test.csv")

Since there is a requirement for the variable to be skewed to the right, I will plot the density of the numeric columns to determine an approriate one to use.

# Plot the density of numeric columns
train |>
  keep(is.numeric) |>
  gather(key = "variable", value = "value") |>  
  ggplot(aes(x = value)) + 
  geom_histogram(aes(y = after_stat(density)), bins = 20, fill = '#4E79A7', color = 'black') + 
  stat_density(geom = "line", color = "red") +
  facet_wrap(~ variable, scales = 'free') +
  theme(strip.text = element_text(size = 5))
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_density()`).

Based on these plots, I will select the TotalBsmtSF variable as X and the SalePrice variable as Y. To ensure that the TotalBsmtSF variable is skewed to the right, I will calculate the skewness using the e1071 package.

skewness <- skewness(train$TotalBsmtSF)
skewness
## [1] 1.521124

Our X variable has a skewness value of 1.52. Usually, any skewness value greater than 1 is considered to be highly skewed.

Next, I will calculate the probabilities a, b, and c as requested.

# Define X and Y
X <- train$TotalBsmtSF
Y <- train$SalePrice

# Find the quartiles
x <- quantile(X, 0.75)  # 3rd quartile of X
y <- quantile(Y, 0.50)  # 2nd quartile of Y

# Store counts
A <- as.numeric(X > x)
B <- as.numeric(Y > y)

# Calculate probabilities
# a. P(X > x | Y > y)
prob_a <- sum(X > x & Y > y) / sum(Y > y)
# b. P(X > x, Y > y)
prob_b <- sum(X > x & Y > y) / nrow(train)
# c. P(X < x | Y > y)
prob_c <- sum(X < x & Y > y) / sum(Y > y)

# Create a table of counts
counts_table <- table(X > x, Y > y)

# Rename columns and rows
colnames(counts_table) <- c("<=2d quartile", ">2d quartile")
rownames(counts_table) <- c("<=3d quartile", ">3d quartile")

# Add total row and column
counts_table <- addmargins(counts_table, margin = 1)
counts_table <- addmargins(counts_table, margin = 2)

# Print the probabilities
cat("Probabilities:\n")
## Probabilities:
cat("a. P(X>x | Y>y) =", prob_a, "\n")
## a. P(X>x | Y>y) = 0.4519231
cat("b. P(X>x, Y>y) =", prob_b, "\n")
## b. P(X>x, Y>y) = 0.2253425
cat("c. P(X<x | Y>y) =", prob_c, "\n")
## c. P(X<x | Y>y) = 0.5480769
# Print the table of counts with totals
cat("\nTable of counts:\n")
## 
## Table of counts:
print(counts_table)
##                
##                 <=2d quartile >2d quartile  Sum
##   <=3d quartile           696          399 1095
##   >3d quartile             36          329  365
##   Sum                     732          728 1460

The probabilities are as follows: a. P(X>x | Y>y) = 0.4519231, which means the probability of TotalBsmtSF being greater than the 3rd quartile (1298.25) given that SalePrice is greater than the 2nd quartile (163000) is around 45.19%. b. P(X>x, Y>y) = 0.2253425, which means the probability of both, TotalBsmtSF being greater than the 3rd quartile (1298.25) and SalePrice being greater than the 2nd quartile (163000) is around 22.53%. c. P(X<x | Y>y) = 0.5480769, which means the probability of TotalBsmtSF being less than the 3rd quartile (1298.25) given that SalePrice is greater than the 2nd quartile (163000) is around 54.81%.

Since the probabilities of a and c are both in cases of Y > y, and their total is 1, the probability of (X=x | Y>y) must be 0. In order to confirm that, I calculated the probability and checked if the sum of the probabilities of a, c, and d is equal to 1.

prob_d <- sum(X == x & Y > y) / sum(Y > y)
prob_d
## [1] 0
# check if prob_a + prob_c + prob_d = 1
prob_a + prob_c + prob_d == 1
## [1] TRUE

Since they all sum up to 1, we can conclude that the probabilities are correct.

Next, I will check if splitting the training data in this fashion makes them independent. I will create new variables A and B based on the conditions provided and then check if P(A|B) = P(A)P(B) both mathematically and using a Chi-Square test for association.

# Create new variables A and B
A <- as.numeric(X > x)
B <- as.numeric(Y > y)


# Calculate P(A|B) and P(A)P(B)
P_A_given_B <- sum(A == 1 & B == 1) / sum(B == 1)
P_A <- sum(A == 1) / nrow(train)
P_B <- sum(B == 1) / nrow(train)

# Check if P(A|B) = P(A)P(B)
P_A_given_B == P_A * P_B
## [1] FALSE

The result is FALSE, which means that A and B are not independent. To confirm this, I will perform a Chi-Square test for association.

# Create a contingency table   
contingency_table <- table(A, B)

# Perform a Chi-Square test for association
chisq.test(contingency_table)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  contingency_table
## X-squared = 313.61, df = 1, p-value < 2.2e-16

The p-value is extremely low (2.2e-16), which means that we reject the null hypothesis that A and B are independent. Therefore, A and B are dependent.

2- Descriptive and Inferential Statistics.

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Provide a 95% CI for the difference in the mean of the variables. Derive a correlation matrix for two of the quantitative variables you selected. Test the hypothesis that the correlation between these variables is 0 and provide a 99% confidence interval. Discuss the meaning of your analysis.

Answer Question #2

Descriptive Statistics

First, I will provide univariate descriptive statistics and appropriate plots for the training data set. This is usually the first step in all my data analysis projects. I typically use the str() and summary() function to get a summary of the data.

# Display the structure of the data
str(train)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : chr  "RL" "RL" "RL" "RL" ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : chr  "Pave" "Pave" "Pave" "Pave" ...
##  $ Alley        : chr  NA NA NA NA ...
##  $ LotShape     : chr  "Reg" "Reg" "IR1" "IR1" ...
##  $ LandContour  : chr  "Lvl" "Lvl" "Lvl" "Lvl" ...
##  $ Utilities    : chr  "AllPub" "AllPub" "AllPub" "AllPub" ...
##  $ LotConfig    : chr  "Inside" "FR2" "Inside" "Corner" ...
##  $ LandSlope    : chr  "Gtl" "Gtl" "Gtl" "Gtl" ...
##  $ Neighborhood : chr  "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
##  $ Condition1   : chr  "Norm" "Feedr" "Norm" "Norm" ...
##  $ Condition2   : chr  "Norm" "Norm" "Norm" "Norm" ...
##  $ BldgType     : chr  "1Fam" "1Fam" "1Fam" "1Fam" ...
##  $ HouseStyle   : chr  "2Story" "1Story" "2Story" "2Story" ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : chr  "Gable" "Gable" "Gable" "Gable" ...
##  $ RoofMatl     : chr  "CompShg" "CompShg" "CompShg" "CompShg" ...
##  $ Exterior1st  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
##  $ Exterior2nd  : chr  "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
##  $ MasVnrType   : chr  "BrkFace" "None" "BrkFace" "None" ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : chr  "Gd" "TA" "Gd" "TA" ...
##  $ ExterCond    : chr  "TA" "TA" "TA" "TA" ...
##  $ Foundation   : chr  "PConc" "CBlock" "PConc" "BrkTil" ...
##  $ BsmtQual     : chr  "Gd" "Gd" "Gd" "TA" ...
##  $ BsmtCond     : chr  "TA" "TA" "TA" "Gd" ...
##  $ BsmtExposure : chr  "No" "Gd" "Mn" "No" ...
##  $ BsmtFinType1 : chr  "GLQ" "ALQ" "GLQ" "ALQ" ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : chr  "Unf" "Unf" "Unf" "Unf" ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : chr  "GasA" "GasA" "GasA" "GasA" ...
##  $ HeatingQC    : chr  "Ex" "Ex" "Ex" "Gd" ...
##  $ CentralAir   : chr  "Y" "Y" "Y" "Y" ...
##  $ Electrical   : chr  "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : chr  "Gd" "TA" "Gd" "Gd" ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : chr  "Typ" "Typ" "Typ" "Typ" ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : chr  NA "TA" "TA" "Gd" ...
##  $ GarageType   : chr  "Attchd" "Attchd" "Attchd" "Detchd" ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : chr  "RFn" "RFn" "RFn" "Unf" ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : chr  "TA" "TA" "TA" "TA" ...
##  $ GarageCond   : chr  "TA" "TA" "TA" "TA" ...
##  $ PavedDrive   : chr  "Y" "Y" "Y" "Y" ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : chr  NA NA NA NA ...
##  $ Fence        : chr  NA NA NA NA ...
##  $ MiscFeature  : chr  NA NA NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : chr  "WD" "WD" "WD" "WD" ...
##  $ SaleCondition: chr  "Normal" "Normal" "Normal" "Abnorml" ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...

The above output gives us the structure of the data set, including the number of observations and variables, the names of the variables, and the data types of the variables.

# Descriptive statistics
summary(train)
##        Id           MSSubClass      MSZoning          LotFrontage    
##  Min.   :   1.0   Min.   : 20.0   Length:1460        Min.   : 21.00  
##  1st Qu.: 365.8   1st Qu.: 20.0   Class :character   1st Qu.: 59.00  
##  Median : 730.5   Median : 50.0   Mode  :character   Median : 69.00  
##  Mean   : 730.5   Mean   : 56.9                      Mean   : 70.05  
##  3rd Qu.:1095.2   3rd Qu.: 70.0                      3rd Qu.: 80.00  
##  Max.   :1460.0   Max.   :190.0                      Max.   :313.00  
##                                                      NA's   :259     
##     LotArea          Street             Alley             LotShape        
##  Min.   :  1300   Length:1460        Length:1460        Length:1460       
##  1st Qu.:  7554   Class :character   Class :character   Class :character  
##  Median :  9478   Mode  :character   Mode  :character   Mode  :character  
##  Mean   : 10517                                                           
##  3rd Qu.: 11602                                                           
##  Max.   :215245                                                           
##                                                                           
##  LandContour         Utilities          LotConfig          LandSlope        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  Neighborhood        Condition1         Condition2          BldgType        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##   HouseStyle         OverallQual      OverallCond      YearBuilt   
##  Length:1460        Min.   : 1.000   Min.   :1.000   Min.   :1872  
##  Class :character   1st Qu.: 5.000   1st Qu.:5.000   1st Qu.:1954  
##  Mode  :character   Median : 6.000   Median :5.000   Median :1973  
##                     Mean   : 6.099   Mean   :5.575   Mean   :1971  
##                     3rd Qu.: 7.000   3rd Qu.:6.000   3rd Qu.:2000  
##                     Max.   :10.000   Max.   :9.000   Max.   :2010  
##                                                                    
##   YearRemodAdd   RoofStyle           RoofMatl         Exterior1st       
##  Min.   :1950   Length:1460        Length:1460        Length:1460       
##  1st Qu.:1967   Class :character   Class :character   Class :character  
##  Median :1994   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :1985                                                           
##  3rd Qu.:2004                                                           
##  Max.   :2010                                                           
##                                                                         
##  Exterior2nd         MasVnrType          MasVnrArea      ExterQual        
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median :   0.0   Mode  :character  
##                                        Mean   : 103.7                     
##                                        3rd Qu.: 166.0                     
##                                        Max.   :1600.0                     
##                                        NA's   :8                          
##   ExterCond          Foundation          BsmtQual           BsmtCond        
##  Length:1460        Length:1460        Length:1460        Length:1460       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##                                                                             
##  BsmtExposure       BsmtFinType1         BsmtFinSF1     BsmtFinType2      
##  Length:1460        Length:1460        Min.   :   0.0   Length:1460       
##  Class :character   Class :character   1st Qu.:   0.0   Class :character  
##  Mode  :character   Mode  :character   Median : 383.5   Mode  :character  
##                                        Mean   : 443.6                     
##                                        3rd Qu.: 712.2                     
##                                        Max.   :5644.0                     
##                                                                           
##    BsmtFinSF2        BsmtUnfSF       TotalBsmtSF       Heating         
##  Min.   :   0.00   Min.   :   0.0   Min.   :   0.0   Length:1460       
##  1st Qu.:   0.00   1st Qu.: 223.0   1st Qu.: 795.8   Class :character  
##  Median :   0.00   Median : 477.5   Median : 991.5   Mode  :character  
##  Mean   :  46.55   Mean   : 567.2   Mean   :1057.4                     
##  3rd Qu.:   0.00   3rd Qu.: 808.0   3rd Qu.:1298.2                     
##  Max.   :1474.00   Max.   :2336.0   Max.   :6110.0                     
##                                                                        
##   HeatingQC          CentralAir         Electrical          X1stFlrSF   
##  Length:1460        Length:1460        Length:1460        Min.   : 334  
##  Class :character   Class :character   Class :character   1st Qu.: 882  
##  Mode  :character   Mode  :character   Mode  :character   Median :1087  
##                                                           Mean   :1163  
##                                                           3rd Qu.:1391  
##                                                           Max.   :4692  
##                                                                         
##    X2ndFlrSF     LowQualFinSF       GrLivArea     BsmtFullBath   
##  Min.   :   0   Min.   :  0.000   Min.   : 334   Min.   :0.0000  
##  1st Qu.:   0   1st Qu.:  0.000   1st Qu.:1130   1st Qu.:0.0000  
##  Median :   0   Median :  0.000   Median :1464   Median :0.0000  
##  Mean   : 347   Mean   :  5.845   Mean   :1515   Mean   :0.4253  
##  3rd Qu.: 728   3rd Qu.:  0.000   3rd Qu.:1777   3rd Qu.:1.0000  
##  Max.   :2065   Max.   :572.000   Max.   :5642   Max.   :3.0000  
##                                                                  
##   BsmtHalfBath        FullBath        HalfBath       BedroomAbvGr  
##  Min.   :0.00000   Min.   :0.000   Min.   :0.0000   Min.   :0.000  
##  1st Qu.:0.00000   1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:2.000  
##  Median :0.00000   Median :2.000   Median :0.0000   Median :3.000  
##  Mean   :0.05753   Mean   :1.565   Mean   :0.3829   Mean   :2.866  
##  3rd Qu.:0.00000   3rd Qu.:2.000   3rd Qu.:1.0000   3rd Qu.:3.000  
##  Max.   :2.00000   Max.   :3.000   Max.   :2.0000   Max.   :8.000  
##                                                                    
##   KitchenAbvGr   KitchenQual         TotRmsAbvGrd     Functional       
##  Min.   :0.000   Length:1460        Min.   : 2.000   Length:1460       
##  1st Qu.:1.000   Class :character   1st Qu.: 5.000   Class :character  
##  Median :1.000   Mode  :character   Median : 6.000   Mode  :character  
##  Mean   :1.047                      Mean   : 6.518                     
##  3rd Qu.:1.000                      3rd Qu.: 7.000                     
##  Max.   :3.000                      Max.   :14.000                     
##                                                                        
##    Fireplaces    FireplaceQu         GarageType         GarageYrBlt  
##  Min.   :0.000   Length:1460        Length:1460        Min.   :1900  
##  1st Qu.:0.000   Class :character   Class :character   1st Qu.:1961  
##  Median :1.000   Mode  :character   Mode  :character   Median :1980  
##  Mean   :0.613                                         Mean   :1979  
##  3rd Qu.:1.000                                         3rd Qu.:2002  
##  Max.   :3.000                                         Max.   :2010  
##                                                        NA's   :81    
##  GarageFinish         GarageCars      GarageArea      GarageQual       
##  Length:1460        Min.   :0.000   Min.   :   0.0   Length:1460       
##  Class :character   1st Qu.:1.000   1st Qu.: 334.5   Class :character  
##  Mode  :character   Median :2.000   Median : 480.0   Mode  :character  
##                     Mean   :1.767   Mean   : 473.0                     
##                     3rd Qu.:2.000   3rd Qu.: 576.0                     
##                     Max.   :4.000   Max.   :1418.0                     
##                                                                        
##   GarageCond         PavedDrive          WoodDeckSF      OpenPorchSF    
##  Length:1460        Length:1460        Min.   :  0.00   Min.   :  0.00  
##  Class :character   Class :character   1st Qu.:  0.00   1st Qu.:  0.00  
##  Mode  :character   Mode  :character   Median :  0.00   Median : 25.00  
##                                        Mean   : 94.24   Mean   : 46.66  
##                                        3rd Qu.:168.00   3rd Qu.: 68.00  
##                                        Max.   :857.00   Max.   :547.00  
##                                                                         
##  EnclosedPorch      X3SsnPorch      ScreenPorch        PoolArea      
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.00   Min.   :  0.000  
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000  
##  Median :  0.00   Median :  0.00   Median :  0.00   Median :  0.000  
##  Mean   : 21.95   Mean   :  3.41   Mean   : 15.06   Mean   :  2.759  
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000  
##  Max.   :552.00   Max.   :508.00   Max.   :480.00   Max.   :738.000  
##                                                                      
##     PoolQC             Fence           MiscFeature           MiscVal        
##  Length:1460        Length:1460        Length:1460        Min.   :    0.00  
##  Class :character   Class :character   Class :character   1st Qu.:    0.00  
##  Mode  :character   Mode  :character   Mode  :character   Median :    0.00  
##                                                           Mean   :   43.49  
##                                                           3rd Qu.:    0.00  
##                                                           Max.   :15500.00  
##                                                                             
##      MoSold           YrSold       SaleType         SaleCondition     
##  Min.   : 1.000   Min.   :2006   Length:1460        Length:1460       
##  1st Qu.: 5.000   1st Qu.:2007   Class :character   Class :character  
##  Median : 6.000   Median :2008   Mode  :character   Mode  :character  
##  Mean   : 6.322   Mean   :2008                                        
##  3rd Qu.: 8.000   3rd Qu.:2009                                        
##  Max.   :12.000   Max.   :2010                                        
##                                                                       
##    SalePrice     
##  Min.   : 34900  
##  1st Qu.:129975  
##  Median :163000  
##  Mean   :180921  
##  3rd Qu.:214000  
##  Max.   :755000  
## 

The above output gives us the quartiles, mean, and standard deviation of the numeric columns in the data set. It also gives us the count and frequency of the categorical variables.We will further analyze these distributions using descriptive plots.

Density Plots

Earlier, I plotted the density of the numeric columns to determine an appropriate variable to use as X. I will repeat that here to help get an overview of the data as a whole.

# Plot the density of numeric columns
train |>
  keep(is.numeric) |>
  gather(key = "variable", value = "value") |>  
  ggplot(aes(x = value)) + 
  geom_histogram(aes(y = after_stat(density)), bins = 20, fill = '#4E79A7', color = 'black') + 
  stat_density(geom = "line", color = "red") +
  facet_wrap(~ variable, scales = 'free') +
  theme(strip.text = element_text(size = 5))
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_bin()`).
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_density()`).

We see many right skewed variables. This is logical since many of these variables are related to the size of the house, and it is common for houses to have a right-skewed distribution of sizes. We see an ID column which doesn’t impart any knowledge and really should not be treated as a numeric column. We also see a Year column which is not a numeric column but should be treated as a factor.

Boxplots

Similar to above, I will create boxplots for the numeric columns to get a better understanding of the data.

# Create boxplots
train |>
  keep(is.numeric) |>
  gather(key = "variable", value = "value") |>  
  ggplot(aes(x = variable, y = value)) + 
  geom_boxplot() +
  facet_wrap(~ variable, scales = 'free') +
  theme(strip.text = element_text(size = 5))
## Warning: Removed 348 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

The above plots show us the distribution of the numeric columns in the data set. We can see that many of the variables have many outliers, which is common in real-world data sets.

Distribution of the Categorical Variables

Next, I will plot the distribution of the categorical variables in the data set.

# Plot the distribution of categorical variables
train |>
  select_if(~ is.character(.)) |>
  gather(key = "variable", value = "value") |>
  ggplot(aes(x = value)) +
  geom_bar(fill = "#4E79A7") +
  facet_wrap(~ variable, scales = 'free') +
  theme(strip.text = element_text(size = 5)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

We see all different types of distributions among the categorical variables. Many of the variables seem to have one predominant category, while others have a more even distribution.

Missing Data

We now check for missing data and visualize the values by plotting the missing data count per column in descending order.

missing_data <- train %>%
  select_if(~ any(is.na(.))) %>%
  summarise_all(~ sum(is.na(.))) %>%
  gather(key = "variable", value = "missing_count") %>%
  arrange(missing_count)

# Plot the missing data pattern
missing_data %>%
  ggplot(aes(x = reorder(variable, missing_count), y = missing_count)) +
  geom_bar(stat = "identity", fill = "#4E79A7") +
  labs(title = "Missing Data Pattern", x = "Variable", y = "Missing Count") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  coord_flip()

We can see, various columns have varying degree of missingness. These columns will need to be treated before we can proceed with the analysis. Luckily, the columns I chose as the X and Y variables do not have any missing data.

Pairplot

Next, I will create a pairplot of the numeric variables to visualize the relationships between the variables. Since there are too many variables to plot and still be able to interpret the plot, I will only plot 6 numeric variables with the highest correlation with the SalePrice variable.

# Find the 6 numeric variables with the highest correlation with SalePrice
correlation_matrix <- cor(train[, sapply(train, is.numeric)])
correlation_with_saleprice <- correlation_matrix["SalePrice", ]
top_correlated_variables <- names(sort(correlation_with_saleprice, decreasing = TRUE)[2:7])

# Create a pairplot of the top correlated variables plus the SalePrice variable
train |>
  dplyr::select(c(top_correlated_variables, "SalePrice")) |>
  ggpairs()
## Warning: Using an external vector in selections was deprecated in tidyselect 1.1.0.
## ℹ Please use `all_of()` or `any_of()` instead.
##   # Was:
##   data %>% select(top_correlated_variables)
## 
##   # Now:
##   data %>% select(all_of(top_correlated_variables))
## 
## See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The pairplot shows the relationships between the top 6 numeric variables with the highest correlation with the SalePrice variable. We see how each variable has positive linear relationships with the SalePrice variable. We also see how some variables have a significant relationship with each other, this is expected since many of these variables are related to the size of the house but can lead to multicollinearity in a model.

Scatterplot of X and Y

Next, I will plot the scatterplot of X, TotalBsmtSF, and Y, SalePrice.

# Plot the scatterplot of X and Y with a linear regression line
ggplot(train, aes(x = TotalBsmtSF, y = SalePrice)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "Scatterplot of TotalBsmtSF and SalePrice", x = "TotalBsmtSF", y = "SalePrice")
## `geom_smooth()` using formula = 'y ~ x'

We see a clearly positive linear relationship between TotalBsmtSF and SalePrice. We also see how the linear regression line seems to be drifting further from the data points as the TotalBsmtSF increases. This illustrates the risks of predicting outside the range of the data.

CI for the Difference in the Mean of the Variables

I am unsure what exactly is meant by the difference in the mean of the variables. I will simply calculate the difference in means between TotalBsmtSF and SalePrice and calculate the 95% confidence interval for this difference. But since these variables are on different scales, I am not sure how meaningful this analysis will be.

Afterwards, I will calculate the difference in means of the TotalBsmtSF when above and below the median of SalePrice and calculate the 95% confidence interval. This will give us a better understanding of the relationship between the two variables and seems like a more ilely interpretation of the assignment.

# Calculate the 95% confidence interval for the difference in the mean of the variables
X_mean <- mean(X)
Y_mean <- mean(Y)
X_sd <- sd(X)
Y_sd <- sd(Y)
n <- length(X)
m <- length(Y)
# Calculate the standard error
SE <- sqrt((X_sd^2 / n) + (Y_sd^2 / m))
# Calculate the margin of error
ME <- qt(0.975, df = n + m - 2) * SE
# Calculate the confidence interval
CI <- c((X_mean - Y_mean) - ME, (X_mean - Y_mean) + ME)
CI
## [1] -183940.5 -175787.0

The 95% Confidence Interval for the difference in means is between -183940.5 and -175787.0. This means that we are 95% confident that the difference in means between TotalBsmtSF and SalePrice is between -183940.5 and -175787.0.

I will double check my work using the t-test function in R.

# Perform t-test
t_test_result <- t.test(X, Y)
# Calculate the confidence interval for the difference in means
ci <- t_test_result$conf.int
# Print the confidence interval
ci
## [1] -183942.2 -175785.3
## attr(,"conf.level")
## [1] 0.95

We achieved the same results using the t-test function in R.

As mentioned above, I will calculate the difference in means of the TotalBsmtSF when above and below the median of SalePrice.

# Calculate the difference in means of TotalBsmtSF when above and below the median of SalePrice
median_saleprice <- median(Y)
# Calculate the 95% confidence interval for the difference in the mean of TotalBsmtSF when above and below the median of SalePrice
X_below_median <- X[Y <= median_saleprice]
X_above_median <- X[Y > median_saleprice]
n_below <- length(X_below_median)
n_above <- length(X_above_median)
mean_below <- mean(X_below_median)
mean_above <- mean(X_above_median)
sd_below <- sd(X_below_median)
sd_above <- sd(X_above_median)
# Calculate the standard error
SE <- sqrt((sd_below^2 / n_below) + (sd_above^2 / n_above))
# Calculate the margin of error
ME <- qt(0.975, df = n_below + n_above - 2) * SE
# Calculate the confidence interval
CI <- c((mean_below - mean_above) - ME, (mean_below - mean_above) + ME)
CI
## [1] -424.9548 -343.9242
# Observed difference in means
mean_below - mean_above
## [1] -384.4395

Again, I will double check my work using the t-test function in R.

# Perform t-test
t_test_result <- t.test(X_below_median, X_above_median)
# Calculate the confidence interval for the difference in means
ci <- t_test_result$conf.int
# Print the confidence interval
ci
## [1] -424.9554 -343.9236
## attr(,"conf.level")
## [1] 0.95

The 95% Confidence Interval for the difference in means of TotalBsmtSF when above and below the median of SalePrice is between -424.96 and -343.92. This means that we are 95% confident that the difference in means between TotalBsmtSF when above and below the median of SalePrice is between -424.96 and -343.92. This tracks with the observed -384.44 difference in means.

Correlation Matrix

Next, I will derive a correlation matrix for two of the quantitative variables I selected and test the hypothesis that the correlation between these variables is 0. I will also calculate the 99% confidence interval for the correlation.

# Derive a correlation matrix for two of the quantitative variables you selected
correlation_matrix <- cor(train[, c("TotalBsmtSF", "SalePrice")])
correlation_matrix
##             TotalBsmtSF SalePrice
## TotalBsmtSF   1.0000000 0.6135806
## SalePrice     0.6135806 1.0000000

The correlation between TotalBsmtSF and SalePrice is 0.61, which indicates a strong positive linear relationship between the two variables. This reinforces what we saw in the scatterplot.

# Calculate correlation coefficient
correlation_coefficient <- cor(X, Y)
# Sample size
n <- length(X)
# Degrees of freedom
df <- n - 2
# Manual calculation of t-statistic
t_statistic <- correlation_coefficient * sqrt(df) / sqrt(1 - correlation_coefficient^2)
# Manual calculation of p-value
p_value <- 2 * pt(-abs(t_statistic), df)

# Calculate 99% confidence interval using Fisher transformation to get the z-score
r_transform <- 0.5 * log((1 + correlation_coefficient) / (1 - correlation_coefficient))
CI_lower <- tanh(r_transform - qnorm(0.995) / sqrt(n - 3))
CI_upper <- tanh(r_transform + qnorm(0.995) / sqrt(n - 3))

# Print the results
cat("Correlation Coefficient:", correlation_coefficient, "\n")
## Correlation Coefficient: 0.6135806
cat("t-statistic:", t_statistic, "\n")
## t-statistic: 29.67055
cat("p-value:", p_value, "\n")
## p-value: 9.484229e-152
cat("99% Confidence Interval:", CI_lower, "-", CI_upper, "\n")
## 99% Confidence Interval: 0.5697562 - 0.6539251

Here, I verified that the correlation coefficient is 0.61, which is the same as the correlation matrix. The t-statistic is 29.67, and the p-value is extremely low, which means we reject the null hypothesis that the correlation between TotalBsmtSF and SalePrice is 0. The 99% confidence interval for the correlation is between 0.57 and 0.65, which means we are 99% confident that the correlation between TotalBsmtSF and SalePrice is between 0.57 and 0.65. This fits our observed correlation of 0.61.

Now, I double check my work using the cor.test function in R.

# Calculate the correlation coefficient
correlation <- cor(X, Y)

# Perform the hypothesis test and obtain the 99% confidence interval
test_result <- cor.test(X, Y, method = "pearson", conf.level = 0.99)

# Print the test result
print(test_result)
## 
##  Pearson's product-moment correlation
## 
## data:  X and Y
## t = 29.671, df = 1458, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 99 percent confidence interval:
##  0.5697562 0.6539251
## sample estimates:
##       cor 
## 0.6135806

We achieved the same results using the cor.test function in R. (I am unsure why the p-value is different, but the correlation coefficient and confidence interval are the same, and both p-values are extremely low).

I will repeat the above analysis for each of the 6 variables most correlated with the SalePrice variable.

# Calculate the correlation matrix for the top correlated variables with SalePrice
correlation_matrix_top <- cor(train[, c(top_correlated_variables, "SalePrice")])
correlation_matrix_top
##             OverallQual GrLivArea GarageCars GarageArea TotalBsmtSF X1stFlrSF
## OverallQual   1.0000000 0.5930074  0.6006707  0.5620218   0.5378085 0.4762238
## GrLivArea     0.5930074 1.0000000  0.4672474  0.4689975   0.4548682 0.5660240
## GarageCars    0.6006707 0.4672474  1.0000000  0.8824754   0.4345848 0.4393168
## GarageArea    0.5620218 0.4689975  0.8824754  1.0000000   0.4866655 0.4897817
## TotalBsmtSF   0.5378085 0.4548682  0.4345848  0.4866655   1.0000000 0.8195300
## X1stFlrSF     0.4762238 0.5660240  0.4393168  0.4897817   0.8195300 1.0000000
## SalePrice     0.7909816 0.7086245  0.6404092  0.6234314   0.6135806 0.6058522
##             SalePrice
## OverallQual 0.7909816
## GrLivArea   0.7086245
## GarageCars  0.6404092
## GarageArea  0.6234314
## TotalBsmtSF 0.6135806
## X1stFlrSF   0.6058522
## SalePrice   1.0000000

The correlation matrix shows the correlation between the top 6 variables and the SalePrice variable. We see that all the variables have a positive correlation with the SalePrice variable, which is expected since they are the top correlated variables. We also see relatively strong correlations with each other as noted above. Overall, the matrix reinforces what we saw in the pairplot.

# Create an empty dataframe to store the results
results_df <- data.frame(variable = character(), observed_correlation = numeric(),
                         CI_lower = numeric(), CI_upper = numeric(), stringsAsFactors = FALSE)

# Create a dataframe to store the correlation coefficients for the top correlated variables
correlation_coefficients <- cor(train[, c(top_correlated_variables, "SalePrice")])

# Iterate over each variable and perform hypothesis testing
for (variable in top_correlated_variables) {
  # Perform Pearson correlation test
  test_result <- cor.test(train[[variable]], train$SalePrice, method = "pearson", conf.level = 0.99)
  
  # Extract information from test result
  observed_correlation <- correlation_coefficients[variable, "SalePrice"]
  CI_lower <- test_result$conf.int[1]
  CI_upper <- test_result$conf.int[2]
  
  # Append results to dataframe
  results_df <- rbind(results_df, data.frame(variable = variable,
                                              observed_correlation = observed_correlation,
                                              CI_lower = CI_lower,
                                              CI_upper = CI_upper))
}

# Print the results dataframe
print(results_df)
##      variable observed_correlation  CI_lower  CI_upper
## 1 OverallQual            0.7909816 0.7643382 0.8149288
## 2   GrLivArea            0.7086245 0.6733974 0.7406408
## 3  GarageCars            0.6404092 0.5988712 0.6785107
## 4  GarageArea            0.6234314 0.5804338 0.6629623
## 5 TotalBsmtSF            0.6135806 0.5697562 0.6539251
## 6   X1stFlrSF            0.6058522 0.5613896 0.6468270

The results dataframe shows the observed correlation between each of the top 6 variables and the SalePrice variable, as well as the 99% confidence interval for the correlation. We see that all the variables have a strong positive correlation with the SalePrice variable, and the confidence intervals are relatively tight. This reinforces what we saw in the pairplot and the correlation matrix.

3-Linear Algebra and Correlation.

Invert your correlation matrix. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct principle components analysis (research this!) and interpret. Discuss.

Answer Question #3

Invert the Correlation Matrix

First, I will invert the correlation matrix to obtain the precision matrix. The precision matrix contains variance inflation factors on the diagonal.

# Invert the correlation matrix
precision_matrix <- solve(correlation_matrix)
precision_matrix
##             TotalBsmtSF  SalePrice
## TotalBsmtSF   1.6038006 -0.9840609
## SalePrice    -0.9840609  1.6038006

Multiply the Correlation Matrix by the Precision Matrix

Next, I will multiply the correlation matrix by the precision matrix.

# Multiply the correlation matrix by the precision matrix
correlation_matrix * precision_matrix
##             TotalBsmtSF  SalePrice
## TotalBsmtSF   1.6038006 -0.6038006
## SalePrice    -0.6038006  1.6038006

Multiply the Precision Matrix by the Correlation Matrix

Finally, I will multiply the precision matrix by the correlation matrix.

# Multiply the precision matrix by the correlation matrix
precision_matrix * correlation_matrix
##             TotalBsmtSF  SalePrice
## TotalBsmtSF   1.6038006 -0.6038006
## SalePrice    -0.6038006  1.6038006

Principal Components Analysis

Next, I will conduct principal components analysis (PCA) on the correlation matrix. PCA is a technique used to reduce the dimensionality of the data by transforming the data into a new coordinate system. It does this by finding the principal components, which are the directions in which the data varies the most. I will use the prcomp function in R to perform PCA.

# Perform PCA
pca_result <- prcomp(train[, c("TotalBsmtSF", "SalePrice")], scale = TRUE)
# Print the PCA result
summary(pca_result)
## Importance of components:
##                           PC1    PC2
## Standard deviation     1.2703 0.6216
## Proportion of Variance 0.8068 0.1932
## Cumulative Proportion  0.8068 1.0000
pca_result
## Standard deviations (1, .., p=2):
## [1] 1.2702679 0.6216265
## 
## Rotation (n x k) = (2 x 2):
##                   PC1        PC2
## TotalBsmtSF 0.7071068 -0.7071068
## SalePrice   0.7071068  0.7071068

The summary of the PCA result shows the standard deviation of the principal components, the proportion of variance explained by each principal component, and the cumulative proportion of variance explained. The first principal component explains 80.68% of the variance, while the second principal component explains 19.32% of the variance. Together, the two principal components explain 100% of the variance. The PCA result shows the principal components, the standard deviation of the principal components, and the rotation matrix. The rotation matrix shows how the original variables are related to the principal components. The first principal component is a linear combination of the original variables that maximizes the variance. The second principal component is orthogonal to the first principal component and captures the remaining variance.

Now I will perform PCA on the complete rows of all the numeric columns of the dataframe. I will remove any binary variables in the dataset since they will not contribute to the PCA analysis.

Before continuing with the PCA analysis, I will impute missing values for the numeric columns. This is to help build the models later on. I will utilize the MICE package to impute the missing values using the mice function. Before doing that, though, I will split the data into training and testing sets to avoid data leakage.

Train/Test Split

Earlier, I noticed some columns with a majority of their values missing. I will remove all the columns with more than 50% missing values from the imputation process and not use any of these columns in the modelling either. I will also remove all the character columns from the data since PCA only works with numeric data (and this simplifies the imputation process)

# Remove columns with more than 50% missing values
missing_values <- sapply(train, function(x) sum(is.na(x)))
cols_to_remove <- names(missing_values[missing_values > 0.5 * nrow(train)]) 
train <- train[, !colnames(train) %in% cols_to_remove]
eval <- eval[, !colnames(eval) %in% cols_to_remove]

# Remove all the character columns
train <- train[, sapply(train, is.numeric)]
eval <- eval[, sapply(eval, is.numeric)]
# Split the data into training and testing sets
set.seed(1125)
train_index <- createDataPartition(train$SalePrice, p = 0.8, list = FALSE, times = 1)
train <- train[train_index, ]
test <- train[-train_index, ]

Data Imputation

The MICE package provides a flexible and easy-to-use method for imputing missing values in a dataset. I will use the mice function to impute the missing values in the numeric columns of the training set. Since the target variable should not be used for imputation, I will exclude the SalePrice variable from the imputation process.

# Remove the SalePrice variable from the training and testing sets
train_no_target <- train[, !colnames(train) %in% c("SalePrice")]
test_no_target <- test[, !colnames(test) %in% c("SalePrice")]

# Combine for imputation
combined_data <- rbind(train_no_target, test_no_target, eval)

# Create a data type variable to indicate whether the data is from the training, testing, or evaluation set
data_type <- c(rep("train", nrow(train)), 
               rep("test", nrow(test)),
               rep("eval", nrow(eval)))

# Function to impute missing values
impute_func <- function(data, data_type) {
    ini <- mice(data, maxit = 0, ignore = data_type != "train")
    meth <- ini$meth
    imputed_object <- mice(data, method = meth, m = 1, maxit = 30, seed = 1125, print = FALSE)
    imputed_data <- complete(imputed_object, 1)
    print(meth)
    
    return(list(imputed_object = imputed_object, imputed_data = imputed_data))
}

# Call function
results <- impute_func(combined_data, data_type)
## Warning: Number of logged events: 416
##            Id    MSSubClass   LotFrontage       LotArea   OverallQual 
##            ""            ""         "pmm"            ""            "" 
##   OverallCond     YearBuilt  YearRemodAdd    MasVnrArea    BsmtFinSF1 
##            ""            ""            ""         "pmm"         "pmm" 
##    BsmtFinSF2     BsmtUnfSF   TotalBsmtSF     X1stFlrSF     X2ndFlrSF 
##         "pmm"         "pmm"         "pmm"            ""            "" 
##  LowQualFinSF     GrLivArea  BsmtFullBath  BsmtHalfBath      FullBath 
##            ""            ""         "pmm"         "pmm"            "" 
##      HalfBath  BedroomAbvGr  KitchenAbvGr  TotRmsAbvGrd    Fireplaces 
##            ""            ""            ""            ""            "" 
##   GarageYrBlt    GarageCars    GarageArea    WoodDeckSF   OpenPorchSF 
##         "pmm"         "pmm"         "pmm"            ""            "" 
## EnclosedPorch    X3SsnPorch   ScreenPorch      PoolArea       MiscVal 
##            ""            ""            ""            ""            "" 
##        MoSold        YrSold 
##            ""            ""
# Reintegrate the target variable
reintegrate_targets <- function(imputed_data, original_data, target_vars) {
  target_data <- original_data[, target_vars, drop = FALSE]
  cbind(imputed_data, target_data)
}

# Combine original data
full_combined_data <- rbind(train, test)

# Reintegrate target
imputed_data_with_targets <- reintegrate_targets(results$imputed_data[data_type != "eval", ], full_combined_data, "SalePrice")

# Split the data back into training, testing, and evaluation sets
train_imputed <- imputed_data_with_targets[data_type == "train", ]
test_imputed <- imputed_data_with_targets[data_type == "test", ]
eval_imputed <- results$imputed_data[data_type == "eval", ]

# Remove the extra columns
cols_to_remove <- c(".imp", ".id")
train_imputed <- train_imputed[, !colnames(train_imputed) %in% cols_to_remove]
test_imputed <- test_imputed[, !colnames(test_imputed) %in% cols_to_remove]
eval_imputed <- eval_imputed[, !colnames(eval_imputed) %in% cols_to_remove]

To confirm that the imputation was successful, I will check the dimensions of the imputed dataframes and compare them to the original dataframes.

dim(train_imputed)
## [1] 1169   38
dim(test_imputed)
## [1] 229  38
dim(eval_imputed)
## [1] 1459   37
dim(train)
## [1] 1169   38
dim(test)
## [1] 229  38
dim(eval)
## [1] 1459   37

The dimensions of the imputed dataframes match the dimensions of the original dataframes, which indicates that the imputation was successful.

Next I will check if there are any missing values in the imputed dataframes to ensure that the imputation was successful.

# Check for missing values in the imputed dataframes
sum(is.na(train_imputed))
## [1] 0
sum(is.na(test_imputed))
## [1] 0
sum(is.na(eval_imputed))
## [1] 0

There are now no missing values in any of the imputed dataframes, which confirms that the imputation was successful. It is important to note that the dataframes all had the character columns removed along with columns with more than 50% missing values before the imputation process.

Principal Components Analysis

pca_result_all <- train_imputed %>%
  select_if(function(col) length(unique(col)) > 2) %>%  # Exclude binary variables
  prcomp(scale = TRUE)  # Perform PCA

# Summary of PCA result
summary(pca_result_all)
## Importance of components:
##                           PC1     PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.8105 1.81751 1.59373 1.43613 1.24698 1.09429 1.09332
## Proportion of Variance 0.2079 0.08693 0.06684 0.05428 0.04092 0.03151 0.03146
## Cumulative Proportion  0.2079 0.29480 0.36164 0.41591 0.45683 0.48835 0.51980
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     1.07343 1.05978 1.04906 1.04087 1.01899 1.00310 0.99072
## Proportion of Variance 0.03032 0.02956 0.02896 0.02851 0.02732 0.02648 0.02583
## Cumulative Proportion  0.55013 0.57968 0.60864 0.63715 0.66448 0.69096 0.71679
##                           PC15    PC16    PC17    PC18    PC19    PC20    PC21
## Standard deviation     0.96551 0.96057 0.92801 0.91377 0.90517 0.89065 0.84954
## Proportion of Variance 0.02453 0.02428 0.02266 0.02197 0.02156 0.02088 0.01899
## Cumulative Proportion  0.74132 0.76560 0.78826 0.81024 0.83180 0.85267 0.87167
##                          PC22    PC23    PC24    PC25   PC26    PC27    PC28
## Standard deviation     0.8246 0.78998 0.75063 0.70955 0.6434 0.62427 0.55898
## Proportion of Variance 0.0179 0.01642 0.01483 0.01325 0.0109 0.01026 0.00822
## Cumulative Proportion  0.8896 0.90598 0.92081 0.93406 0.9450 0.95521 0.96344
##                           PC29    PC30    PC31    PC32    PC33    PC34    PC35
## Standard deviation     0.52646 0.51266 0.44767 0.44632 0.36959 0.35559 0.31314
## Proportion of Variance 0.00729 0.00692 0.00527 0.00524 0.00359 0.00333 0.00258
## Cumulative Proportion  0.97073 0.97765 0.98292 0.98816 0.99176 0.99508 0.99766
##                           PC36      PC37      PC38
## Standard deviation     0.29797 1.248e-15 4.226e-16
## Proportion of Variance 0.00234 0.000e+00 0.000e+00
## Cumulative Proportion  1.00000 1.000e+00 1.000e+00

The summary of the PCA result shows that 18 principal components contain 80% of the variance, 23 principal components contain 90% of the variance, 27 principal components contain 95% of the variance, and 33 principal components contain 99% of the variance. This means that we can reduce the dimensionality of the data from 38 variables to those 18, 23, 27, or 33 principal components while retaining their respective percentages of variance. I will store the PCA components of each of the above thresholds as a variable for potential use in later modeling.

# Store the PCA components for each threshold
pca_components_80 <- pca_result_all$x[, 1:18]
pca_components_90 <- pca_result_all$x[, 1:23]
pca_components_95 <- pca_result_all$x[, 1:27]
pca_components_99 <- pca_result_all$x[, 1:33]

4-Calculus-Based Probability & Statistics.

Many times, it makes sense to fit a closed form distribution to data. For your variable that is skewed to the right, shift it so that the minimum value is above zero. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R-devel/library/MASS/html/fitdistr.html ). Find the optimal value of \(\lambda\) for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, \(\lambda\))). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

Answer Question #4

Fit an Exponential Probability Density Function

First, I will shift the variable that is skewed to the right so that the minimum value is above zero. I have saved the TotalBsmtSF variable as X.

# Shift the variable so that the minimum value is above zero
X_shifted <- X - min(X) + 0.1

Next, I will fit an exponential probability density function to the shifted variable using the fitdistr function from the MASS package. I will find the optimal value of \(\lambda\) for this distribution.

# Fit an exponential probability density function to the shifted variable
fit <- fitdistr(X_shifted, densfun = "exponential")
# Optimal value of lambda
lambda <- fit$estimate
lambda
##         rate 
## 0.0009456001

Generate Samples from the Exponential Distribution and Plot Histograms

Next, I will take 1000 samples from this exponential distribution using the optimal value of \(\lambda\) and plot a histogram to compare it with a histogram of the original variable. I will add vertical lines for the mean and mode of both the original variable and the exponential distribution. This will aid in visualizing the differences between the two distributions.

# Take 1000 samples from the exponential distribution
set.seed((1125))
samples <- rexp(1000, lambda)

# Calculate mean and mode of the original variable
mean_original <- mean(X_shifted)
mode_original <- density(X_shifted)$x[which.max(density(X_shifted)$y)]

# Calculate mean and mode of the exponential distribution
mean_exponential <- 1 / lambda  # Mean of exponential distribution is 1 / lambda
mode_exponential <- 0  # Mode of exponential distribution is always 0

# Plot a histogram of the original variable and the exponential distribution
ggplot() +
  geom_histogram(aes(X_shifted, fill = "Original Variable"), bins = 30, alpha = 0.7) +
  geom_histogram(aes(samples, fill = "Exponential Distribution"), bins = 30, alpha = 0.7) +
  geom_vline(xintercept = mean_original, linetype = "dashed", color = "blue", size = 1) +  # Mean line (original)
  geom_vline(xintercept = mode_original, linetype = "dashed", color = "red", size = 1) +   # Mode line (original)
  geom_vline(xintercept = mean_exponential, linetype = "dotted", color = "green", size = 1) +  # Mean line (exponential)
  geom_vline(xintercept = mode_exponential, linetype = "dotted", color = "purple", size = 1) +  # Mode line (exponential)
  labs(title = "Histogram of Original Variable and Exponential Distribution", x = "Value", y = "Frequency") +
  scale_fill_manual(values = c("#4E79A7", "#F28E2B"), 
                    labels = c("Original Variable", "Exponential Distribution"))
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The above plot shows the histogram of the original variable and the exponential distribution. The blue dashed line represents the mean of the original variable, the red dashed line represents the mode of the original variable, the green dotted line represents the mean of the exponential distribution, and the purple dotted line represents the mode of the exponential distribution. We see the mean of both the original variable and the exponential distribution are the same, while the mode of the original variable is very different from the mode of the exponential distribution. This is expected since the exponential distribution is unimodal with a mode of 0 and we know we set the minimum value of the shifted variable to be above 0.

Based on the above plots, it seems like the exponential distribution does not fit the original variable very well. While the mean stays the same, the overall distribution is very different.

Find the 5th and 95th Percentiles Using the Exponential CDF

Next, I will use the exponential probability density function to find the 5th and 95th percentiles using the cumulative distribution function (CDF).

# Find the 5th and 95th percentiles using the exponential CDF
percentile_5 <- qexp(0.05, rate = lambda)
percentile_95 <- qexp(0.95, rate = lambda)
percentile_5
## [1] 54.24417
percentile_95
## [1] 3168.075

The 5th percentile of the exponential distribution is 54.24 and the 95th percentile is 3168.08. This means that 90% of the exponential distribution falls between 54.24 and 3168.08.

Generate a 95% Confidence Interval from the Empirical Data

Next, I will generate a 95% confidence interval from the empirical data, assuming normality. For this section, I will use the original variable X, not the shifted variable X_shifted. But the difference in the CIs should be almost irrelevant since the original shift barely moved the data.

# Calculate the standard error
SE <- sd(X) / sqrt(length(X))
# Calculate the margin of error
ME <- qt(0.975, df = length(X) - 1) * SE
# Calculate the confidence interval
CI <- c(mean(X) - ME, mean(X) + ME)
CI
## [1] 1034.908 1079.951

The 95% Confidence Interval for the original variable X is between 1034.9 and 1079.95. This means that we are 95% confident that the mean of the original variable X is between 1034.9 and 1079.95. As before, I will double check our work using the t-test function in R.

# Generate a 95% confidence interval assuming normality
conf_interval <- t.test(X)$conf.int
# Print the confidence interval
cat("95% Confidence Interval (Normality assumption): [", conf_interval[1], ",", conf_interval[2], "]\n")
## 95% Confidence Interval (Normality assumption): [ 1034.908 , 1079.951 ]

The results are the same as before.

Empirical 5th and 95th Percentiles of the Data

Finally, I will calculate the empirical 5th and 95th percentiles of the original variable X.

# Calculate the 5th and 95th percentiles for variable "X"
percentile_5th <- quantile(X, probs = 0.05)
percentile_95th <- quantile(X, probs = 0.95)
percentile_5th
##    5% 
## 519.3
percentile_95th
##  95% 
## 1753

The empirical 5th percentile of the original variable X is 519.3 and the 95th percentile is 1753. This means that 90% of the data falls between 519.3 and 1753.

5-Modeling.

Build some type of regression model and submit your model to the competition board. Provide your complete model summary and results with analysis. Report your Kaggle.com user name and score.

Answer Question #5

Build a Regression Model

For this question, I will build a linear regression model to predict the SalePrice variable using the TotalBsmtSF variable. These were selected in the previous sections as the X and Y variables. I will then build a regression model using all the numeric variables, a regression model using the top correlated variables, and regression models using the PCA components. I will predict the SalePrice variable against the testing set and calculate the RMSE and R-squared values for each model.

# Build a linear regression model
model <- lm(SalePrice ~ TotalBsmtSF, train_imputed)
# Print the model summary
summary(model)
## 
## Call:
## lm(formula = SalePrice ~ TotalBsmtSF, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -176177  -40006  -14862   35293  403093 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 51726.570   4876.258   10.61   <2e-16 ***
## TotalBsmtSF   122.823      4.299   28.57   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 61630 on 1167 degrees of freedom
## Multiple R-squared:  0.4116, Adjusted R-squared:  0.4111 
## F-statistic: 816.4 on 1 and 1167 DF,  p-value: < 2.2e-16
# Build a regression model using all the numeric variables
model_all <- lm(SalePrice ~ ., train_imputed)
# Print the model summary
summary(model_all)
## 
## Call:
## lm(formula = SalePrice ~ ., data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -138346  -15955   -1147   14172  217521 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.300e+05  1.338e+06  -0.321  0.74806    
## Id            -2.309e-02  2.069e+00  -0.011  0.99110    
## MSSubClass    -1.382e+02  2.680e+01  -5.158 2.95e-07 ***
## LotFrontage    5.356e+01  4.771e+01   1.123  0.26185    
## LotArea        4.872e-01  9.231e-02   5.278 1.57e-07 ***
## OverallQual    1.530e+04  1.147e+03  13.344  < 2e-16 ***
## OverallCond    4.706e+03  9.553e+02   4.926 9.64e-07 ***
## YearBuilt      3.544e+02  6.718e+01   5.275 1.59e-07 ***
## YearRemodAdd   1.584e+02  6.469e+01   2.449  0.01446 *  
## MasVnrArea     3.315e+01  5.704e+00   5.813 7.98e-09 ***
## BsmtFinSF1     4.155e+01  4.535e+00   9.162  < 2e-16 ***
## BsmtFinSF2     1.827e+01  6.642e+00   2.751  0.00603 ** 
## BsmtUnfSF      2.097e+01  3.958e+00   5.298 1.41e-07 ***
## TotalBsmtSF           NA         NA      NA       NA    
## X1stFlrSF      7.026e+01  5.631e+00  12.477  < 2e-16 ***
## X2ndFlrSF      7.863e+01  4.748e+00  16.560  < 2e-16 ***
## LowQualFinSF   2.244e+01  1.749e+01   1.283  0.19974    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   1.210e+03  2.471e+03   0.489  0.62461    
## BsmtHalfBath  -8.845e+02  3.792e+03  -0.233  0.81559    
## FullBath      -3.003e+03  2.662e+03  -1.128  0.25948    
## HalfBath      -6.784e+03  2.535e+03  -2.676  0.00755 ** 
## BedroomAbvGr  -1.375e+04  1.625e+03  -8.460  < 2e-16 ***
## KitchenAbvGr  -1.119e+04  4.906e+03  -2.280  0.02279 *  
## TotRmsAbvGrd   2.969e+03  1.194e+03   2.488  0.01300 *  
## Fireplaces     2.713e+01  1.688e+03   0.016  0.98718    
## GarageYrBlt   -7.529e+00  7.052e+01  -0.107  0.91500    
## GarageCars     2.039e+03  2.724e+03   0.749  0.45425    
## GarageArea     1.344e+01  9.504e+00   1.414  0.15761    
## WoodDeckSF     1.631e+01  7.579e+00   2.152  0.03158 *  
## OpenPorchSF    1.838e+01  1.433e+01   1.283  0.19982    
## EnclosedPorch -1.422e+01  1.634e+01  -0.870  0.38453    
## X3SsnPorch    -4.598e-01  2.754e+01  -0.017  0.98668    
## ScreenPorch    2.071e+01  1.617e+01   1.280  0.20067    
## PoolArea       5.610e+01  2.089e+01   2.685  0.00735 ** 
## MiscVal       -1.505e+00  1.581e+00  -0.952  0.34139    
## MoSold        -1.331e+02  3.276e+02  -0.406  0.68454    
## YrSold        -3.106e+02  6.663e+02  -0.466  0.64118    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29400 on 1133 degrees of freedom
## Multiple R-squared:   0.87,  Adjusted R-squared:  0.866 
## F-statistic: 216.7 on 35 and 1133 DF,  p-value: < 2.2e-16
# Build a regression model using the top correlated variables
model_top <- lm(SalePrice ~ ., train_imputed[, c(top_correlated_variables, "SalePrice")])
# Print the model summary
summary(model_top)
## 
## Call:
## lm(formula = SalePrice ~ ., data = train_imputed[, c(top_correlated_variables, 
##     "SalePrice")])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -146424  -19747   -1269   17147  250127 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.138e+05  5.012e+03 -22.707  < 2e-16 ***
## OverallQual  2.252e+04  1.109e+03  20.302  < 2e-16 ***
## GrLivArea    5.560e+01  2.737e+00  20.314  < 2e-16 ***
## GarageCars   4.061e+03  3.114e+03   1.304 0.192471    
## GarageArea   3.974e+01  1.061e+01   3.747 0.000187 ***
## TotalBsmtSF  3.657e+01  4.378e+00   8.352  < 2e-16 ***
## X1stFlrSF    7.743e+00  5.014e+00   1.544 0.122807    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 35180 on 1162 degrees of freedom
## Multiple R-squared:  0.8091, Adjusted R-squared:  0.8081 
## F-statistic: 820.8 on 6 and 1162 DF,  p-value: < 2.2e-16
# Build a regression model using the PCA components
pca_80 <- cbind(SalePrice = train_imputed$SalePrice, pca_components_80)

model_pca_80 <- lm(SalePrice ~ ., as.data.frame(pca_80))
# Print the model summary
summary(model_pca_80)
## 
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(pca_80))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -111363  -16440   -1587   14109  221191 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 181183.60     848.23 213.603  < 2e-16 ***
## PC1          25951.12     301.93  85.950  < 2e-16 ***
## PC2           1583.71     466.90   3.392 0.000717 ***
## PC3          -4335.87     532.46  -8.143 9.94e-16 ***
## PC4           5096.21     590.89   8.625  < 2e-16 ***
## PC5          -6243.57     680.52  -9.175  < 2e-16 ***
## PC6           -307.08     775.47  -0.396 0.692185    
## PC7          -3608.17     776.16  -4.649 3.73e-06 ***
## PC8          -7004.98     790.54  -8.861  < 2e-16 ***
## PC9           4050.73     800.72   5.059 4.91e-07 ***
## PC10          2065.12     808.90   2.553 0.010809 *  
## PC11           839.59     815.27   1.030 0.303305    
## PC12          4047.54     832.78   4.860 1.33e-06 ***
## PC13          1534.67     845.97   1.814 0.069923 .  
## PC14          1425.41     856.54   1.664 0.096356 .  
## PC15          1616.89     878.90   1.840 0.066073 .  
## PC16         -3068.96     883.42  -3.474 0.000532 ***
## PC17            64.63     914.42   0.071 0.943661    
## PC18          2135.54     928.67   2.300 0.021651 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29000 on 1150 degrees of freedom
## Multiple R-squared:  0.8716, Adjusted R-squared:  0.8696 
## F-statistic: 433.8 on 18 and 1150 DF,  p-value: < 2.2e-16
# Build a regression model using the PCA components
pca_90 <- cbind(SalePrice = train_imputed$SalePrice, pca_components_90)

model_pca_90 <- lm(SalePrice ~ ., as.data.frame(pca_90))
# Print the model summary
summary(model_pca_90)
## 
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(pca_90))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -115735  -15300   -1922   13700  204129 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 181183.60     823.39 220.047  < 2e-16 ***
## PC1          25951.12     293.09  88.542  < 2e-16 ***
## PC2           1583.71     453.22   3.494 0.000493 ***
## PC3          -4335.87     516.86  -8.389  < 2e-16 ***
## PC4           5096.21     573.58   8.885  < 2e-16 ***
## PC5          -6243.57     660.59  -9.452  < 2e-16 ***
## PC6           -307.08     752.76  -0.408 0.683397    
## PC7          -3608.17     753.43  -4.789 1.90e-06 ***
## PC8          -7004.98     767.39  -9.128  < 2e-16 ***
## PC9           4050.73     777.27   5.211 2.22e-07 ***
## PC10          2065.12     785.22   2.630 0.008653 ** 
## PC11           839.59     791.40   1.061 0.288959    
## PC12          4047.54     808.39   5.007 6.40e-07 ***
## PC13          1534.67     821.20   1.869 0.061902 .  
## PC14          1425.41     831.46   1.714 0.086736 .  
## PC15          1616.89     853.16   1.895 0.058321 .  
## PC16         -3068.96     857.55  -3.579 0.000360 ***
## PC17            64.63     887.64   0.073 0.941965    
## PC18          2135.54     901.47   2.369 0.018004 *  
## PC19          1480.49     910.04   1.627 0.104046    
## PC20          1952.95     924.87   2.112 0.034937 *  
## PC21         -5620.63     969.63  -5.797 8.74e-09 ***
## PC22         -5760.86     998.91  -5.767 1.04e-08 ***
## PC23          1260.73    1042.74   1.209 0.226889    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 28150 on 1145 degrees of freedom
## Multiple R-squared:  0.8796, Adjusted R-squared:  0.8771 
## F-statistic: 363.5 on 23 and 1145 DF,  p-value: < 2.2e-16
# Build a regression model using the PCA components
pca_95 <- cbind(SalePrice = train_imputed$SalePrice, pca_components_95)

model_pca_95 <- lm(SalePrice ~ ., as.data.frame(pca_95))
# Print the model summary
summary(model_pca_95)
## 
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(pca_95))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -110355  -13723    -989   12948  162820 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 181183.60     716.46 252.886  < 2e-16 ***
## PC1          25951.12     255.03 101.756  < 2e-16 ***
## PC2           1583.71     394.37   4.016 6.31e-05 ***
## PC3          -4335.87     449.75  -9.641  < 2e-16 ***
## PC4           5096.21     499.10  10.211  < 2e-16 ***
## PC5          -6243.57     574.81 -10.862  < 2e-16 ***
## PC6           -307.08     655.01  -0.469 0.639291    
## PC7          -3608.17     655.59  -5.504 4.59e-08 ***
## PC8          -7004.98     667.74 -10.491  < 2e-16 ***
## PC9           4050.73     676.34   5.989 2.82e-09 ***
## PC10          2065.12     683.25   3.022 0.002563 ** 
## PC11           839.59     688.63   1.219 0.223010    
## PC12          4047.54     703.41   5.754 1.12e-08 ***
## PC13          1534.67     714.56   2.148 0.031946 *  
## PC14          1425.41     723.49   1.970 0.049058 *  
## PC15          1616.89     742.37   2.178 0.029610 *  
## PC16         -3068.96     746.19  -4.113 4.19e-05 ***
## PC17            64.63     772.37   0.084 0.933323    
## PC18          2135.54     784.41   2.722 0.006578 ** 
## PC19          1480.49     791.87   1.870 0.061791 .  
## PC20          1952.95     804.77   2.427 0.015390 *  
## PC21         -5620.63     843.71  -6.662 4.20e-11 ***
## PC22         -5760.86     869.19  -6.628 5.24e-11 ***
## PC23          1260.73     907.33   1.389 0.164953    
## PC24          -252.87     954.89  -0.265 0.791197    
## PC25          3477.78    1010.17   3.443 0.000597 ***
## PC26         20663.82    1113.96  18.550  < 2e-16 ***
## PC27          4481.66    1148.17   3.903 0.000100 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24500 on 1141 degrees of freedom
## Multiple R-squared:  0.9091, Adjusted R-squared:  0.907 
## F-statistic: 422.8 on 27 and 1141 DF,  p-value: < 2.2e-16
# Build a regression model using the PCA components
pca_99 <- cbind(SalePrice = train_imputed$SalePrice, pca_components_99)

model_pca_99 <- lm(SalePrice ~ ., as.data.frame(pca_99))
# Print the model summary
summary(model_pca_99)
## 
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(pca_99))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83702 -10816     82  10788 140491 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 181183.60     559.70 323.716  < 2e-16 ***
## PC1          25951.12     199.23 130.257  < 2e-16 ***
## PC2           1583.71     308.08   5.141 3.22e-07 ***
## PC3          -4335.87     351.34 -12.341  < 2e-16 ***
## PC4           5096.21     389.89  13.071  < 2e-16 ***
## PC5          -6243.57     449.04 -13.904  < 2e-16 ***
## PC6           -307.08     511.69  -0.600 0.548542    
## PC7          -3608.17     512.14  -7.045 3.20e-12 ***
## PC8          -7004.98     521.63 -13.429  < 2e-16 ***
## PC9           4050.73     528.35   7.667 3.78e-14 ***
## PC10          2065.12     533.75   3.869 0.000115 ***
## PC11           839.59     537.95   1.561 0.118869    
## PC12          4047.54     549.51   7.366 3.38e-13 ***
## PC13          1534.67     558.21   2.749 0.006067 ** 
## PC14          1425.41     565.18   2.522 0.011804 *  
## PC15          1616.89     579.94   2.788 0.005391 ** 
## PC16         -3068.96     582.92  -5.265 1.68e-07 ***
## PC17            64.63     603.38   0.107 0.914711    
## PC18          2135.54     612.78   3.485 0.000511 ***
## PC19          1480.49     618.60   2.393 0.016860 *  
## PC20          1952.95     628.68   3.106 0.001941 ** 
## PC21         -5620.63     659.11  -8.528  < 2e-16 ***
## PC22         -5760.86     679.01  -8.484  < 2e-16 ***
## PC23          1260.73     708.80   1.779 0.075560 .  
## PC24          -252.87     745.95  -0.339 0.734680    
## PC25          3477.78     789.14   4.407 1.15e-05 ***
## PC26         20663.82     870.22  23.745  < 2e-16 ***
## PC27          4481.66     896.94   4.997 6.75e-07 ***
## PC28          6428.26    1001.72   6.417 2.03e-10 ***
## PC29         -2626.69    1063.59  -2.470 0.013671 *  
## PC30        -17238.56    1092.23 -15.783  < 2e-16 ***
## PC31        -10383.98    1250.79  -8.302 2.88e-16 ***
## PC32          8577.79    1254.58   6.837 1.32e-11 ***
## PC33         27212.41    1515.03  17.962  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19140 on 1135 degrees of freedom
## Multiple R-squared:  0.9448, Adjusted R-squared:  0.9432 
## F-statistic:   589 on 33 and 1135 DF,  p-value: < 2.2e-16

Predict all the above models against the testing set

First I need to perform the same PCA on the testing set as I did on the training set. I will use the same PCA components as I did on the training set.

# Perform PCA on the testing set
pca_test <- prcomp(test_imputed, scale = TRUE)
# Add the PCA components to the testing set
test_imputed <- cbind(test_imputed, pca_test$x)
# Predict the SalePrice using the linear regression model
predictions <- predict(model, test_imputed)
# Calculate the RMSE
rmse_model1 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))

predictions <- predict(model_all, test_imputed)
# Calculate the RMSE
rmse_model_all <- sqrt(mean((test_imputed$SalePrice - predictions)^2))

predictions <- predict(model_top, test_imputed)
# Calculate the RMSE
rmse_model_top <- sqrt(mean((test_imputed$SalePrice - predictions)^2))

predictions <- predict(model_pca_80, test_imputed)
# Calculate the RMSE
rmse_model_pca_80 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))

predictions <- predict(model_pca_90, test_imputed)
# Calculate the RMSE
rmse_model_pca_90 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))

predictions <- predict(model_pca_95, test_imputed)
# Calculate the RMSE
rmse_model_pca_95 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))

predictions <- predict(model_pca_99, test_imputed)
# Calculate the RMSE
rmse_model_pca_99 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))

# Combine all the models rmse and r-squared into one dataframe
model_results <- data.frame(Model = c("TotalBsmtSF", "All Numeric Variables", "Top Correlated Variables", "PCA 80", "PCA 90", "PCA 95", "PCA 99"),
                             RMSE = c(rmse_model1, rmse_model_all, rmse_model_top, rmse_model_pca_80, rmse_model_pca_90, rmse_model_pca_95, rmse_model_pca_99),
                             R_squared = c(summary(model)$r.squared, summary(model_all)$r.squared, summary(model_top)$r.squared, summary(model_pca_80)$r.squared, summary(model_pca_90)$r.squared, summary(model_pca_95)$r.squared, summary(model_pca_99)$r.squared),
                             AIC = c(AIC(model), AIC(model_all), AIC(model_top), AIC(model_pca_80), AIC(model_pca_90), AIC(model_pca_95), AIC(model_pca_99)))

model_results
##                      Model      RMSE R_squared      AIC
## 1              TotalBsmtSF  61202.28 0.4116120 29107.16
## 2    All Numeric Variables  27467.64 0.8700424 27409.76
## 3 Top Correlated Variables  33713.83 0.8090931 27801.33
## 4                   PCA 80 147295.33 0.8716175 27361.50
## 5                   PCA 90 147595.97 0.8795522 27296.92
## 6                   PCA 95 148919.64 0.9091219 26975.62
## 7                   PCA 99 150560.87 0.9448317 26404.14

Using the above table allows me to easily see that including more than the PCA components of the top 80% of the variance does not improve the model. This allows us to eliminate the extra PCA models. The model with the lowest AIC was the model using the top correlated variables. The model with the lowest RMSE was the model using all the numeric variables. The model with the highest R-squared was the PCA model. I will work on improving all three of these models and see which we can improve the most.

We start with adding a backwards elimination to the model using all the numeric variables. This will allow us to simplify what is likely the most complicated model with the most variables and reduces the need for PCA since it reduces multicollinearity

# Perform backwards elimination on the model using all the numeric variables
model_all_backwards <- step(model_all)
## Start:  AIC=24090.28
## SalePrice ~ Id + MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + 
##     LowQualFinSF + GrLivArea + BsmtFullBath + BsmtHalfBath + 
##     FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + 
##     Fireplaces + GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + 
##     OpenPorchSF + EnclosedPorch + X3SsnPorch + ScreenPorch + 
##     PoolArea + MiscVal + MoSold + YrSold
## 
## 
## Step:  AIC=24090.28
## SalePrice ~ Id + MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + TotalBsmtSF + X1stFlrSF + X2ndFlrSF + 
##     LowQualFinSF + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + 
##     GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + 
##     EnclosedPorch + X3SsnPorch + ScreenPorch + PoolArea + MiscVal + 
##     MoSold + YrSold
## 
## 
## Step:  AIC=24090.28
## SalePrice ~ Id + MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + 
##     KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + 
##     GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + 
##     X3SsnPorch + ScreenPorch + PoolArea + MiscVal + MoSold + 
##     YrSold
## 
##                 Df  Sum of Sq        RSS   AIC
## - Id             1 1.0760e+05 9.7911e+11 24088
## - Fireplaces     1 2.2312e+05 9.7911e+11 24088
## - X3SsnPorch     1 2.4093e+05 9.7911e+11 24088
## - GarageYrBlt    1 9.8503e+06 9.7912e+11 24088
## - BsmtHalfBath   1 4.7025e+07 9.7916e+11 24088
## - MoSold         1 1.4271e+08 9.7925e+11 24089
## - YrSold         1 1.8780e+08 9.7930e+11 24089
## - BsmtFullBath   1 2.0703e+08 9.7932e+11 24089
## - GarageCars     1 4.8430e+08 9.7959e+11 24089
## - EnclosedPorch  1 6.5397e+08 9.7976e+11 24089
## - MiscVal        1 7.8292e+08 9.7989e+11 24089
## - LotFrontage    1 1.0891e+09 9.8020e+11 24090
## - FullBath       1 1.0999e+09 9.8021e+11 24090
## - ScreenPorch    1 1.4167e+09 9.8053e+11 24090
## - OpenPorchSF    1 1.4221e+09 9.8053e+11 24090
## - LowQualFinSF   1 1.4226e+09 9.8053e+11 24090
## <none>                        9.7911e+11 24090
## - GarageArea     1 1.7280e+09 9.8084e+11 24090
## - WoodDeckSF     1 4.0035e+09 9.8311e+11 24093
## - KitchenAbvGr   1 4.4923e+09 9.8360e+11 24094
## - YearRemodAdd   1 5.1843e+09 9.8429e+11 24095
## - TotRmsAbvGrd   1 5.3478e+09 9.8446e+11 24095
## - HalfBath       1 6.1897e+09 9.8530e+11 24096
## - PoolArea       1 6.2311e+09 9.8534e+11 24096
## - BsmtFinSF2     1 6.5422e+09 9.8565e+11 24096
## - OverallCond    1 2.0970e+10 1.0001e+12 24113
## - MSSubClass     1 2.2988e+10 1.0021e+12 24115
## - YearBuilt      1 2.4047e+10 1.0032e+12 24117
## - LotArea        1 2.4070e+10 1.0032e+12 24117
## - BsmtUnfSF      1 2.4254e+10 1.0034e+12 24117
## - MasVnrArea     1 2.9199e+10 1.0083e+12 24123
## - BedroomAbvGr   1 6.1854e+10 1.0410e+12 24160
## - BsmtFinSF1     1 7.2542e+10 1.0517e+12 24172
## - X1stFlrSF      1 1.3453e+11 1.1136e+12 24239
## - OverallQual    1 1.5387e+11 1.1330e+12 24259
## - X2ndFlrSF      1 2.3698e+11 1.2161e+12 24342
## 
## Step:  AIC=24088.28
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + 
##     KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + 
##     GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + 
##     X3SsnPorch + ScreenPorch + PoolArea + MiscVal + MoSold + 
##     YrSold
## 
##                 Df  Sum of Sq        RSS   AIC
## - X3SsnPorch     1 2.2817e+05 9.7911e+11 24086
## - Fireplaces     1 2.2972e+05 9.7911e+11 24086
## - GarageYrBlt    1 9.9144e+06 9.7912e+11 24086
## - BsmtHalfBath   1 4.7023e+07 9.7916e+11 24086
## - MoSold         1 1.4315e+08 9.7925e+11 24087
## - YrSold         1 1.8791e+08 9.7930e+11 24087
## - BsmtFullBath   1 2.0693e+08 9.7932e+11 24087
## - GarageCars     1 4.8419e+08 9.7959e+11 24087
## - EnclosedPorch  1 6.5389e+08 9.7976e+11 24087
## - MiscVal        1 7.8291e+08 9.7989e+11 24087
## - LotFrontage    1 1.0908e+09 9.8020e+11 24088
## - FullBath       1 1.1002e+09 9.8021e+11 24088
## - ScreenPorch    1 1.4169e+09 9.8053e+11 24088
## - OpenPorchSF    1 1.4225e+09 9.8053e+11 24088
## - LowQualFinSF   1 1.4285e+09 9.8054e+11 24088
## <none>                        9.7911e+11 24088
## - GarageArea     1 1.7281e+09 9.8084e+11 24088
## - WoodDeckSF     1 4.0133e+09 9.8312e+11 24091
## - KitchenAbvGr   1 4.5008e+09 9.8361e+11 24092
## - YearRemodAdd   1 5.1884e+09 9.8430e+11 24093
## - TotRmsAbvGrd   1 5.3483e+09 9.8446e+11 24093
## - HalfBath       1 6.2064e+09 9.8532e+11 24094
## - PoolArea       1 6.2525e+09 9.8536e+11 24094
## - BsmtFinSF2     1 6.5558e+09 9.8567e+11 24094
## - OverallCond    1 2.0995e+10 1.0001e+12 24111
## - MSSubClass     1 2.2997e+10 1.0021e+12 24113
## - YearBuilt      1 2.4050e+10 1.0032e+12 24115
## - LotArea        1 2.4098e+10 1.0032e+12 24115
## - BsmtUnfSF      1 2.4271e+10 1.0034e+12 24115
## - MasVnrArea     1 2.9284e+10 1.0084e+12 24121
## - BedroomAbvGr   1 6.1931e+10 1.0410e+12 24158
## - BsmtFinSF1     1 7.2623e+10 1.0517e+12 24170
## - X1stFlrSF      1 1.3503e+11 1.1141e+12 24237
## - OverallQual    1 1.5439e+11 1.1335e+12 24257
## - X2ndFlrSF      1 2.3702e+11 1.2161e+12 24340
## 
## Step:  AIC=24086.28
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + 
##     KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + 
##     GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + 
##     ScreenPorch + PoolArea + MiscVal + MoSold + YrSold
## 
##                 Df  Sum of Sq        RSS   AIC
## - Fireplaces     1 2.3774e+05 9.7911e+11 24084
## - GarageYrBlt    1 9.9313e+06 9.7912e+11 24084
## - BsmtHalfBath   1 4.7199e+07 9.7916e+11 24084
## - MoSold         1 1.4381e+08 9.7925e+11 24085
## - YrSold         1 1.8865e+08 9.7930e+11 24085
## - BsmtFullBath   1 2.0681e+08 9.7932e+11 24085
## - GarageCars     1 4.8397e+08 9.7959e+11 24085
## - EnclosedPorch  1 6.5409e+08 9.7976e+11 24085
## - MiscVal        1 7.8306e+08 9.7989e+11 24085
## - LotFrontage    1 1.0978e+09 9.8021e+11 24086
## - FullBath       1 1.1024e+09 9.8021e+11 24086
## - ScreenPorch    1 1.4207e+09 9.8053e+11 24086
## - OpenPorchSF    1 1.4257e+09 9.8054e+11 24086
## - LowQualFinSF   1 1.4283e+09 9.8054e+11 24086
## <none>                        9.7911e+11 24086
## - GarageArea     1 1.7304e+09 9.8084e+11 24086
## - WoodDeckSF     1 4.0290e+09 9.8314e+11 24089
## - KitchenAbvGr   1 4.5025e+09 9.8361e+11 24090
## - YearRemodAdd   1 5.1882e+09 9.8430e+11 24091
## - TotRmsAbvGrd   1 5.3556e+09 9.8447e+11 24091
## - HalfBath       1 6.2082e+09 9.8532e+11 24092
## - PoolArea       1 6.2534e+09 9.8536e+11 24092
## - BsmtFinSF2     1 6.5683e+09 9.8568e+11 24092
## - OverallCond    1 2.1005e+10 1.0001e+12 24109
## - MSSubClass     1 2.3000e+10 1.0021e+12 24111
## - YearBuilt      1 2.4056e+10 1.0032e+12 24113
## - LotArea        1 2.4104e+10 1.0032e+12 24113
## - BsmtUnfSF      1 2.4271e+10 1.0034e+12 24113
## - MasVnrArea     1 2.9286e+10 1.0084e+12 24119
## - BedroomAbvGr   1 6.2029e+10 1.0411e+12 24156
## - BsmtFinSF1     1 7.2626e+10 1.0517e+12 24168
## - X1stFlrSF      1 1.3534e+11 1.1144e+12 24236
## - OverallQual    1 1.5441e+11 1.1335e+12 24256
## - X2ndFlrSF      1 2.3711e+11 1.2162e+12 24338
## 
## Step:  AIC=24084.28
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + 
##     KitchenAbvGr + TotRmsAbvGrd + GarageYrBlt + GarageCars + 
##     GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + 
##     PoolArea + MiscVal + MoSold + YrSold
## 
##                 Df  Sum of Sq        RSS   AIC
## - GarageYrBlt    1 1.0335e+07 9.7912e+11 24082
## - BsmtHalfBath   1 4.7089e+07 9.7916e+11 24082
## - MoSold         1 1.4360e+08 9.7925e+11 24083
## - YrSold         1 1.8850e+08 9.7930e+11 24083
## - BsmtFullBath   1 2.0696e+08 9.7932e+11 24083
## - GarageCars     1 4.9098e+08 9.7960e+11 24083
## - EnclosedPorch  1 6.5503e+08 9.7977e+11 24083
## - MiscVal        1 7.8282e+08 9.7989e+11 24083
## - LotFrontage    1 1.1005e+09 9.8021e+11 24084
## - FullBath       1 1.1045e+09 9.8022e+11 24084
## - OpenPorchSF    1 1.4281e+09 9.8054e+11 24084
## - LowQualFinSF   1 1.4281e+09 9.8054e+11 24084
## - ScreenPorch    1 1.4491e+09 9.8056e+11 24084
## <none>                        9.7911e+11 24084
## - GarageArea     1 1.7401e+09 9.8085e+11 24084
## - WoodDeckSF     1 4.0549e+09 9.8317e+11 24087
## - KitchenAbvGr   1 4.5869e+09 9.8370e+11 24088
## - YearRemodAdd   1 5.2048e+09 9.8432e+11 24089
## - TotRmsAbvGrd   1 5.3564e+09 9.8447e+11 24089
## - HalfBath       1 6.2229e+09 9.8533e+11 24090
## - PoolArea       1 6.2539e+09 9.8536e+11 24090
## - BsmtFinSF2     1 6.5733e+09 9.8568e+11 24090
## - OverallCond    1 2.1011e+10 1.0001e+12 24107
## - MSSubClass     1 2.3030e+10 1.0021e+12 24110
## - YearBuilt      1 2.4057e+10 1.0032e+12 24111
## - BsmtUnfSF      1 2.4368e+10 1.0035e+12 24111
## - LotArea        1 2.4463e+10 1.0036e+12 24111
## - MasVnrArea     1 2.9287e+10 1.0084e+12 24117
## - BedroomAbvGr   1 6.2722e+10 1.0418e+12 24155
## - BsmtFinSF1     1 7.2653e+10 1.0518e+12 24166
## - X1stFlrSF      1 1.4354e+11 1.1226e+12 24242
## - OverallQual    1 1.5753e+11 1.1366e+12 24257
## - X2ndFlrSF      1 2.4040e+11 1.2195e+12 24339
## 
## Step:  AIC=24082.29
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + 
##     KitchenAbvGr + TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF + 
##     OpenPorchSF + EnclosedPorch + ScreenPorch + PoolArea + MiscVal + 
##     MoSold + YrSold
## 
##                 Df  Sum of Sq        RSS   AIC
## - BsmtHalfBath   1 4.6204e+07 9.7917e+11 24080
## - MoSold         1 1.4295e+08 9.7926e+11 24081
## - YrSold         1 1.8891e+08 9.7931e+11 24081
## - BsmtFullBath   1 2.0933e+08 9.7933e+11 24081
## - GarageCars     1 4.8691e+08 9.7961e+11 24081
## - EnclosedPorch  1 6.5544e+08 9.7978e+11 24081
## - MiscVal        1 7.8220e+08 9.7990e+11 24081
## - FullBath       1 1.1168e+09 9.8024e+11 24082
## - LotFrontage    1 1.1313e+09 9.8025e+11 24082
## - LowQualFinSF   1 1.4204e+09 9.8054e+11 24082
## - OpenPorchSF    1 1.4252e+09 9.8055e+11 24082
## - ScreenPorch    1 1.4565e+09 9.8058e+11 24082
## <none>                        9.7912e+11 24082
## - GarageArea     1 1.8013e+09 9.8092e+11 24082
## - WoodDeckSF     1 4.0523e+09 9.8317e+11 24085
## - KitchenAbvGr   1 4.5766e+09 9.8370e+11 24086
## - TotRmsAbvGrd   1 5.3475e+09 9.8447e+11 24087
## - YearRemodAdd   1 5.3901e+09 9.8451e+11 24087
## - HalfBath       1 6.2129e+09 9.8533e+11 24088
## - PoolArea       1 6.2820e+09 9.8540e+11 24088
## - BsmtFinSF2     1 6.5733e+09 9.8569e+11 24088
## - OverallCond    1 2.1183e+10 1.0003e+12 24105
## - MSSubClass     1 2.3113e+10 1.0022e+12 24108
## - BsmtUnfSF      1 2.4366e+10 1.0035e+12 24109
## - LotArea        1 2.4520e+10 1.0036e+12 24109
## - MasVnrArea     1 2.9422e+10 1.0085e+12 24115
## - YearBuilt      1 3.2040e+10 1.0112e+12 24118
## - BedroomAbvGr   1 6.2713e+10 1.0418e+12 24153
## - BsmtFinSF1     1 7.2706e+10 1.0518e+12 24164
## - X1stFlrSF      1 1.4483e+11 1.1240e+12 24242
## - OverallQual    1 1.5752e+11 1.1366e+12 24255
## - X2ndFlrSF      1 2.4108e+11 1.2202e+12 24338
## 
## Step:  AIC=24080.35
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + 
##     TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + 
##     EnclosedPorch + ScreenPorch + PoolArea + MiscVal + MoSold + 
##     YrSold
## 
##                 Df  Sum of Sq        RSS   AIC
## - MoSold         1 1.4456e+08 9.7931e+11 24079
## - YrSold         1 1.8021e+08 9.7935e+11 24079
## - BsmtFullBath   1 2.9603e+08 9.7946e+11 24079
## - GarageCars     1 4.8616e+08 9.7965e+11 24079
## - EnclosedPorch  1 6.6299e+08 9.7983e+11 24079
## - MiscVal        1 7.7039e+08 9.7994e+11 24079
## - FullBath       1 1.0886e+09 9.8026e+11 24080
## - LotFrontage    1 1.1102e+09 9.8028e+11 24080
## - LowQualFinSF   1 1.4231e+09 9.8059e+11 24080
## - OpenPorchSF    1 1.4278e+09 9.8059e+11 24080
## - ScreenPorch    1 1.4521e+09 9.8062e+11 24080
## <none>                        9.7917e+11 24080
## - GarageArea     1 1.8071e+09 9.8097e+11 24081
## - WoodDeckSF     1 4.0239e+09 9.8319e+11 24083
## - KitchenAbvGr   1 4.5612e+09 9.8373e+11 24084
## - YearRemodAdd   1 5.3545e+09 9.8452e+11 24085
## - TotRmsAbvGrd   1 5.3805e+09 9.8455e+11 24085
## - HalfBath       1 6.1830e+09 9.8535e+11 24086
## - PoolArea       1 6.2674e+09 9.8543e+11 24086
## - BsmtFinSF2     1 6.5314e+09 9.8570e+11 24086
## - OverallCond    1 2.1178e+10 1.0003e+12 24103
## - MSSubClass     1 2.3375e+10 1.0025e+12 24106
## - BsmtUnfSF      1 2.4384e+10 1.0036e+12 24107
## - LotArea        1 2.4475e+10 1.0036e+12 24107
## - MasVnrArea     1 2.9385e+10 1.0086e+12 24113
## - YearBuilt      1 3.1997e+10 1.0112e+12 24116
## - BedroomAbvGr   1 6.3496e+10 1.0427e+12 24152
## - BsmtFinSF1     1 7.3768e+10 1.0529e+12 24163
## - X1stFlrSF      1 1.4491e+11 1.1241e+12 24240
## - OverallQual    1 1.5791e+11 1.1371e+12 24253
## - X2ndFlrSF      1 2.4126e+11 1.2204e+12 24336
## 
## Step:  AIC=24078.52
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + 
##     TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + 
##     EnclosedPorch + ScreenPorch + PoolArea + MiscVal + YrSold
## 
##                 Df  Sum of Sq        RSS   AIC
## - YrSold         1 1.4199e+08 9.7945e+11 24077
## - BsmtFullBath   1 3.0597e+08 9.7962e+11 24077
## - GarageCars     1 4.8538e+08 9.7980e+11 24077
## - EnclosedPorch  1 6.3805e+08 9.7995e+11 24077
## - MiscVal        1 7.6489e+08 9.8008e+11 24077
## - LotFrontage    1 1.0837e+09 9.8040e+11 24078
## - FullBath       1 1.0959e+09 9.8041e+11 24078
## - OpenPorchSF    1 1.3977e+09 9.8071e+11 24078
## - ScreenPorch    1 1.4338e+09 9.8075e+11 24078
## - LowQualFinSF   1 1.4473e+09 9.8076e+11 24078
## <none>                        9.7931e+11 24079
## - GarageArea     1 1.8358e+09 9.8115e+11 24079
## - WoodDeckSF     1 4.0024e+09 9.8331e+11 24081
## - KitchenAbvGr   1 4.6376e+09 9.8395e+11 24082
## - YearRemodAdd   1 5.3462e+09 9.8466e+11 24083
## - TotRmsAbvGrd   1 5.5200e+09 9.8483e+11 24083
## - HalfBath       1 6.1500e+09 9.8546e+11 24084
## - PoolArea       1 6.3592e+09 9.8567e+11 24084
## - BsmtFinSF2     1 6.5866e+09 9.8590e+11 24084
## - OverallCond    1 2.1147e+10 1.0005e+12 24102
## - MSSubClass     1 2.3338e+10 1.0026e+12 24104
## - BsmtUnfSF      1 2.4574e+10 1.0039e+12 24106
## - LotArea        1 2.4581e+10 1.0039e+12 24106
## - MasVnrArea     1 2.9398e+10 1.0087e+12 24111
## - YearBuilt      1 3.2019e+10 1.0113e+12 24114
## - BedroomAbvGr   1 6.3950e+10 1.0433e+12 24151
## - BsmtFinSF1     1 7.3922e+10 1.0532e+12 24162
## - X1stFlrSF      1 1.4479e+11 1.1241e+12 24238
## - OverallQual    1 1.5783e+11 1.1371e+12 24251
## - X2ndFlrSF      1 2.4112e+11 1.2204e+12 24334
## 
## Step:  AIC=24076.69
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + 
##     TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + 
##     EnclosedPorch + ScreenPorch + PoolArea + MiscVal
## 
##                 Df  Sum of Sq        RSS   AIC
## - BsmtFullBath   1 2.7259e+08 9.7973e+11 24075
## - GarageCars     1 5.0704e+08 9.7996e+11 24075
## - EnclosedPorch  1 6.3891e+08 9.8009e+11 24076
## - MiscVal        1 7.6499e+08 9.8022e+11 24076
## - LotFrontage    1 1.0513e+09 9.8050e+11 24076
## - FullBath       1 1.1057e+09 9.8056e+11 24076
## - ScreenPorch    1 1.4087e+09 9.8086e+11 24076
## - OpenPorchSF    1 1.4253e+09 9.8088e+11 24076
## - LowQualFinSF   1 1.4702e+09 9.8092e+11 24076
## <none>                        9.7945e+11 24077
## - GarageArea     1 1.8249e+09 9.8128e+11 24077
## - WoodDeckSF     1 3.9721e+09 9.8343e+11 24079
## - KitchenAbvGr   1 4.7314e+09 9.8419e+11 24080
## - YearRemodAdd   1 5.2827e+09 9.8474e+11 24081
## - TotRmsAbvGrd   1 5.5575e+09 9.8501e+11 24081
## - HalfBath       1 6.1792e+09 9.8563e+11 24082
## - PoolArea       1 6.5283e+09 9.8598e+11 24083
## - BsmtFinSF2     1 6.5619e+09 9.8602e+11 24083
## - OverallCond    1 2.1120e+10 1.0006e+12 24100
## - MSSubClass     1 2.3263e+10 1.0027e+12 24102
## - LotArea        1 2.4653e+10 1.0041e+12 24104
## - BsmtUnfSF      1 2.4679e+10 1.0041e+12 24104
## - MasVnrArea     1 2.9408e+10 1.0089e+12 24109
## - YearBuilt      1 3.2013e+10 1.0115e+12 24112
## - BedroomAbvGr   1 6.3818e+10 1.0433e+12 24149
## - BsmtFinSF1     1 7.4470e+10 1.0539e+12 24160
## - X1stFlrSF      1 1.4483e+11 1.1243e+12 24236
## - OverallQual    1 1.5772e+11 1.1372e+12 24249
## - X2ndFlrSF      1 2.4108e+11 1.2205e+12 24332
## 
## Step:  AIC=24075.02
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + 
##     GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + 
##     ScreenPorch + PoolArea + MiscVal
## 
##                 Df  Sum of Sq        RSS   AIC
## - GarageCars     1 5.1100e+08 9.8024e+11 24074
## - EnclosedPorch  1 6.0998e+08 9.8034e+11 24074
## - MiscVal        1 8.1104e+08 9.8054e+11 24074
## - LotFrontage    1 1.0507e+09 9.8078e+11 24074
## - FullBath       1 1.2665e+09 9.8099e+11 24075
## - ScreenPorch    1 1.4168e+09 9.8114e+11 24075
## - LowQualFinSF   1 1.4619e+09 9.8119e+11 24075
## - OpenPorchSF    1 1.4728e+09 9.8120e+11 24075
## <none>                        9.7973e+11 24075
## - GarageArea     1 1.8496e+09 9.8158e+11 24075
## - WoodDeckSF     1 4.1210e+09 9.8385e+11 24078
## - KitchenAbvGr   1 4.6559e+09 9.8438e+11 24079
## - YearRemodAdd   1 5.5699e+09 9.8530e+11 24080
## - TotRmsAbvGrd   1 5.6451e+09 9.8537e+11 24080
## - HalfBath       1 6.3062e+09 9.8603e+11 24081
## - PoolArea       1 6.5583e+09 9.8628e+11 24081
## - BsmtFinSF2     1 7.4581e+09 9.8718e+11 24082
## - OverallCond    1 2.0915e+10 1.0006e+12 24098
## - MSSubClass     1 2.3001e+10 1.0027e+12 24100
## - BsmtUnfSF      1 2.4672e+10 1.0044e+12 24102
## - LotArea        1 2.4965e+10 1.0047e+12 24102
## - MasVnrArea     1 2.9163e+10 1.0089e+12 24107
## - YearBuilt      1 3.2275e+10 1.0120e+12 24111
## - BedroomAbvGr   1 6.3859e+10 1.0436e+12 24147
## - BsmtFinSF1     1 9.4538e+10 1.0743e+12 24181
## - X1stFlrSF      1 1.4455e+11 1.1243e+12 24234
## - OverallQual    1 1.5758e+11 1.1373e+12 24247
## - X2ndFlrSF      1 2.4082e+11 1.2205e+12 24330
## 
## Step:  AIC=24073.63
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + 
##     GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + 
##     PoolArea + MiscVal
## 
##                 Df  Sum of Sq        RSS   AIC
## - EnclosedPorch  1 6.3871e+08 9.8088e+11 24072
## - MiscVal        1 8.6575e+08 9.8110e+11 24073
## - LotFrontage    1 1.0372e+09 9.8127e+11 24073
## - FullBath       1 1.1803e+09 9.8142e+11 24073
## - LowQualFinSF   1 1.3410e+09 9.8158e+11 24073
## - OpenPorchSF    1 1.3559e+09 9.8159e+11 24073
## - ScreenPorch    1 1.4518e+09 9.8169e+11 24073
## <none>                        9.8024e+11 24074
## - WoodDeckSF     1 4.1121e+09 9.8435e+11 24077
## - KitchenAbvGr   1 4.6725e+09 9.8491e+11 24077
## - YearRemodAdd   1 5.6542e+09 9.8589e+11 24078
## - TotRmsAbvGrd   1 5.7904e+09 9.8603e+11 24079
## - HalfBath       1 6.1375e+09 9.8637e+11 24079
## - PoolArea       1 6.5401e+09 9.8678e+11 24079
## - BsmtFinSF2     1 7.2690e+09 9.8751e+11 24080
## - GarageArea     1 1.0144e+10 9.9038e+11 24084
## - OverallCond    1 2.0750e+10 1.0010e+12 24096
## - MSSubClass     1 2.2828e+10 1.0031e+12 24099
## - BsmtUnfSF      1 2.4506e+10 1.0047e+12 24101
## - LotArea        1 2.5329e+10 1.0056e+12 24101
## - MasVnrArea     1 2.9067e+10 1.0093e+12 24106
## - YearBuilt      1 3.3344e+10 1.0136e+12 24111
## - BedroomAbvGr   1 6.3998e+10 1.0442e+12 24146
## - BsmtFinSF1     1 9.4043e+10 1.0743e+12 24179
## - X1stFlrSF      1 1.4524e+11 1.1255e+12 24233
## - OverallQual    1 1.6090e+11 1.1411e+12 24249
## - X2ndFlrSF      1 2.4078e+11 1.2210e+12 24328
## 
## Step:  AIC=24072.39
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + 
##     GarageArea + WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea + 
##     MiscVal
## 
##                Df  Sum of Sq        RSS   AIC
## - MiscVal       1 8.9586e+08 9.8177e+11 24072
## - LotFrontage   1 9.7591e+08 9.8185e+11 24072
## - FullBath      1 1.1984e+09 9.8207e+11 24072
## - LowQualFinSF  1 1.3741e+09 9.8225e+11 24072
## - OpenPorchSF   1 1.4508e+09 9.8233e+11 24072
## <none>                       9.8088e+11 24072
## - ScreenPorch   1 1.7835e+09 9.8266e+11 24073
## - WoodDeckSF    1 4.4151e+09 9.8529e+11 24076
## - KitchenAbvGr  1 4.6344e+09 9.8551e+11 24076
## - YearRemodAdd  1 5.4613e+09 9.8634e+11 24077
## - TotRmsAbvGrd  1 5.9966e+09 9.8687e+11 24078
## - HalfBath      1 6.0607e+09 9.8694e+11 24078
## - PoolArea      1 6.2921e+09 9.8717e+11 24078
## - BsmtFinSF2    1 7.0914e+09 9.8797e+11 24079
## - GarageArea    1 1.0016e+10 9.9089e+11 24082
## - OverallCond   1 2.1989e+10 1.0029e+12 24096
## - MSSubClass    1 2.2738e+10 1.0036e+12 24097
## - BsmtUnfSF     1 2.4262e+10 1.0051e+12 24099
## - LotArea       1 2.5707e+10 1.0066e+12 24101
## - MasVnrArea    1 2.9339e+10 1.0102e+12 24105
## - YearBuilt     1 4.1019e+10 1.0219e+12 24118
## - BedroomAbvGr  1 6.3794e+10 1.0447e+12 24144
## - BsmtFinSF1    1 9.3655e+10 1.0745e+12 24177
## - X1stFlrSF     1 1.4497e+11 1.1258e+12 24232
## - OverallQual   1 1.6033e+11 1.1412e+12 24247
## - X2ndFlrSF     1 2.4020e+11 1.2211e+12 24326
## 
## Step:  AIC=24071.45
## SalePrice ~ MSSubClass + LotFrontage + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + 
##     GarageArea + WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea
## 
##                Df  Sum of Sq        RSS   AIC
## - LotFrontage   1 9.6947e+08 9.8274e+11 24071
## - FullBath      1 1.1728e+09 9.8294e+11 24071
## - LowQualFinSF  1 1.3945e+09 9.8317e+11 24071
## - OpenPorchSF   1 1.4823e+09 9.8325e+11 24071
## <none>                       9.8177e+11 24072
## - ScreenPorch   1 1.6942e+09 9.8347e+11 24072
## - WoodDeckSF    1 4.4369e+09 9.8621e+11 24075
## - KitchenAbvGr  1 4.9784e+09 9.8675e+11 24075
## - YearRemodAdd  1 5.4975e+09 9.8727e+11 24076
## - TotRmsAbvGrd  1 5.7800e+09 9.8755e+11 24076
## - HalfBath      1 6.0322e+09 9.8780e+11 24077
## - PoolArea      1 6.1365e+09 9.8791e+11 24077
## - BsmtFinSF2    1 7.0270e+09 9.8880e+11 24078
## - GarageArea    1 1.0167e+10 9.9194e+11 24082
## - OverallCond   1 2.1471e+10 1.0032e+12 24095
## - MSSubClass    1 2.2454e+10 1.0042e+12 24096
## - BsmtUnfSF     1 2.4034e+10 1.0058e+12 24098
## - LotArea       1 2.5355e+10 1.0071e+12 24099
## - MasVnrArea    1 2.9541e+10 1.0113e+12 24104
## - YearBuilt     1 4.0633e+10 1.0224e+12 24117
## - BedroomAbvGr  1 6.3187e+10 1.0450e+12 24142
## - BsmtFinSF1    1 9.3139e+10 1.0749e+12 24175
## - X1stFlrSF     1 1.4613e+11 1.1279e+12 24232
## - OverallQual   1 1.6083e+11 1.1426e+12 24247
## - X2ndFlrSF     1 2.4004e+11 1.2218e+12 24325
## 
## Step:  AIC=24070.61
## SalePrice ~ MSSubClass + LotArea + OverallQual + OverallCond + 
##     YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + FullBath + 
##     HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea + 
##     WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea
## 
##                Df  Sum of Sq        RSS   AIC
## - FullBath      1 1.2151e+09 9.8396e+11 24070
## - OpenPorchSF   1 1.5064e+09 9.8425e+11 24070
## - LowQualFinSF  1 1.5244e+09 9.8427e+11 24070
## - ScreenPorch   1 1.5677e+09 9.8431e+11 24071
## <none>                       9.8274e+11 24071
## - WoodDeckSF    1 4.1882e+09 9.8693e+11 24074
## - KitchenAbvGr  1 4.8071e+09 9.8755e+11 24074
## - YearRemodAdd  1 5.4456e+09 9.8819e+11 24075
## - TotRmsAbvGrd  1 5.8919e+09 9.8863e+11 24076
## - HalfBath      1 6.0559e+09 9.8880e+11 24076
## - PoolArea      1 6.5366e+09 9.8928e+11 24076
## - BsmtFinSF2    1 6.9364e+09 9.8968e+11 24077
## - GarageArea    1 1.1211e+10 9.9395e+11 24082
## - OverallCond   1 2.1911e+10 1.0047e+12 24094
## - BsmtUnfSF     1 2.3447e+10 1.0062e+12 24096
## - MasVnrArea    1 2.9602e+10 1.0123e+12 24103
## - MSSubClass    1 2.9644e+10 1.0124e+12 24103
## - LotArea       1 3.0055e+10 1.0128e+12 24104
## - YearBuilt     1 4.1017e+10 1.0238e+12 24116
## - BedroomAbvGr  1 6.2225e+10 1.0450e+12 24140
## - BsmtFinSF1    1 9.2492e+10 1.0752e+12 24174
## - X1stFlrSF     1 1.5305e+11 1.1358e+12 24238
## - OverallQual   1 1.6069e+11 1.1434e+12 24246
## - X2ndFlrSF     1 2.4324e+11 1.2260e+12 24327
## 
## Step:  AIC=24070.05
## SalePrice ~ MSSubClass + LotArea + OverallQual + OverallCond + 
##     YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + HalfBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea + 
##     WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea
## 
##                Df  Sum of Sq        RSS   AIC
## - OpenPorchSF   1 1.3346e+09 9.8529e+11 24070
## - LowQualFinSF  1 1.4675e+09 9.8542e+11 24070
## - ScreenPorch   1 1.6399e+09 9.8560e+11 24070
## <none>                       9.8396e+11 24070
## - WoodDeckSF    1 4.1964e+09 9.8815e+11 24073
## - HalfBath      1 4.8958e+09 9.8885e+11 24074
## - YearRemodAdd  1 4.9647e+09 9.8892e+11 24074
## - KitchenAbvGr  1 5.7542e+09 9.8971e+11 24075
## - TotRmsAbvGrd  1 5.8498e+09 9.8981e+11 24075
## - PoolArea      1 6.7169e+09 9.9067e+11 24076
## - BsmtFinSF2    1 7.2826e+09 9.9124e+11 24077
## - GarageArea    1 1.1378e+10 9.9533e+11 24082
## - OverallCond   1 2.2538e+10 1.0065e+12 24095
## - BsmtUnfSF     1 2.3626e+10 1.0076e+12 24096
## - LotArea       1 2.9575e+10 1.0135e+12 24103
## - MasVnrArea    1 3.0110e+10 1.0141e+12 24103
## - MSSubClass    1 3.0202e+10 1.0142e+12 24103
## - YearBuilt     1 4.0880e+10 1.0248e+12 24116
## - BedroomAbvGr  1 6.4365e+10 1.0483e+12 24142
## - BsmtFinSF1    1 9.5276e+10 1.0792e+12 24176
## - X1stFlrSF     1 1.5634e+11 1.1403e+12 24240
## - OverallQual   1 1.5951e+11 1.1435e+12 24244
## - X2ndFlrSF     1 2.7446e+11 1.2584e+12 24356
## 
## Step:  AIC=24069.64
## SalePrice ~ MSSubClass + LotArea + OverallQual + OverallCond + 
##     YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + HalfBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea + 
##     WoodDeckSF + ScreenPorch + PoolArea
## 
##                Df  Sum of Sq        RSS   AIC
## - LowQualFinSF  1 1.5645e+09 9.8686e+11 24070
## <none>                       9.8529e+11 24070
## - ScreenPorch   1 1.7452e+09 9.8704e+11 24070
## - WoodDeckSF    1 3.9819e+09 9.8927e+11 24072
## - HalfBath      1 4.7083e+09 9.9000e+11 24073
## - YearRemodAdd  1 5.1416e+09 9.9043e+11 24074
## - TotRmsAbvGrd  1 5.7281e+09 9.9102e+11 24074
## - KitchenAbvGr  1 6.0131e+09 9.9130e+11 24075
## - PoolArea      1 6.7659e+09 9.9206e+11 24076
## - BsmtFinSF2    1 7.3686e+09 9.9266e+11 24076
## - GarageArea    1 1.1643e+10 9.9693e+11 24081
## - OverallCond   1 2.2821e+10 1.0081e+12 24094
## - BsmtUnfSF     1 2.4661e+10 1.0100e+12 24097
## - LotArea       1 2.9603e+10 1.0149e+12 24102
## - MasVnrArea    1 2.9628e+10 1.0149e+12 24102
## - MSSubClass    1 3.0414e+10 1.0157e+12 24103
## - YearBuilt     1 4.1262e+10 1.0266e+12 24116
## - BedroomAbvGr  1 6.4487e+10 1.0498e+12 24142
## - BsmtFinSF1    1 9.6950e+10 1.0822e+12 24177
## - X1stFlrSF     1 1.5785e+11 1.1431e+12 24241
## - OverallQual   1 1.6187e+11 1.1472e+12 24246
## - X2ndFlrSF     1 2.8065e+11 1.2659e+12 24361
## 
## Step:  AIC=24069.49
## SalePrice ~ MSSubClass + LotArea + OverallQual + OverallCond + 
##     YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + BsmtFinSF2 + 
##     BsmtUnfSF + X1stFlrSF + X2ndFlrSF + HalfBath + BedroomAbvGr + 
##     KitchenAbvGr + TotRmsAbvGrd + GarageArea + WoodDeckSF + ScreenPorch + 
##     PoolArea
## 
##                Df  Sum of Sq        RSS   AIC
## <none>                       9.8686e+11 24070
## - ScreenPorch   1 1.7909e+09 9.8865e+11 24070
## - WoodDeckSF    1 3.9871e+09 9.9084e+11 24072
## - HalfBath      1 4.8047e+09 9.9166e+11 24073
## - YearRemodAdd  1 5.3353e+09 9.9219e+11 24074
## - KitchenAbvGr  1 6.8138e+09 9.9367e+11 24076
## - TotRmsAbvGrd  1 7.0366e+09 9.9389e+11 24076
## - PoolArea      1 7.2233e+09 9.9408e+11 24076
## - BsmtFinSF2    1 7.4082e+09 9.9426e+11 24076
## - GarageArea    1 1.1531e+10 9.9839e+11 24081
## - OverallCond   1 2.2117e+10 1.0090e+12 24093
## - BsmtUnfSF     1 2.5062e+10 1.0119e+12 24097
## - MasVnrArea    1 2.9075e+10 1.0159e+12 24101
## - MSSubClass    1 2.9449e+10 1.0163e+12 24102
## - LotArea       1 2.9586e+10 1.0164e+12 24102
## - YearBuilt     1 3.9745e+10 1.0266e+12 24114
## - BedroomAbvGr  1 6.4680e+10 1.0515e+12 24142
## - BsmtFinSF1    1 9.7510e+10 1.0844e+12 24178
## - X1stFlrSF     1 1.5643e+11 1.1433e+12 24240
## - OverallQual   1 1.6277e+11 1.1496e+12 24246
## - X2ndFlrSF     1 2.7921e+11 1.2661e+12 24359
# Print the model summary
summary(model_all_backwards)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotArea + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + HalfBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea + 
##     WoodDeckSF + ScreenPorch + PoolArea, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -138320  -15519   -1104   14104  216346 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.033e+06  1.073e+05  -9.625  < 2e-16 ***
## MSSubClass   -1.441e+02  2.462e+01  -5.853 6.29e-09 ***
## LotArea       5.140e-01  8.762e-02   5.867 5.81e-09 ***
## OverallQual   1.532e+04  1.114e+03  13.760  < 2e-16 ***
## OverallCond   4.713e+03  9.291e+02   5.072 4.58e-07 ***
## YearBuilt     3.413e+02  5.020e+01   6.800 1.68e-11 ***
## YearRemodAdd  1.535e+02  6.163e+01   2.491 0.012868 *  
## MasVnrArea    3.276e+01  5.633e+00   5.816 7.82e-09 ***
## BsmtFinSF1    4.281e+01  4.019e+00  10.650  < 2e-16 ***
## BsmtFinSF2    1.877e+01  6.394e+00   2.936 0.003395 ** 
## BsmtUnfSF     2.107e+01  3.903e+00   5.399 8.12e-08 ***
## X1stFlrSF     6.942e+01  5.146e+00  13.490  < 2e-16 ***
## X2ndFlrSF     7.630e+01  4.234e+00  18.022  < 2e-16 ***
## HalfBath     -5.536e+03  2.342e+03  -2.364 0.018236 *  
## BedroomAbvGr -1.371e+04  1.581e+03  -8.674  < 2e-16 ***
## KitchenAbvGr -1.326e+04  4.710e+03  -2.815 0.004955 ** 
## TotRmsAbvGrd  3.327e+03  1.163e+03   2.861 0.004299 ** 
## GarageArea    1.992e+01  5.440e+00   3.663 0.000261 ***
## WoodDeckSF    1.596e+01  7.413e+00   2.154 0.031476 *  
## ScreenPorch   2.273e+01  1.575e+01   1.443 0.149192    
## PoolArea      5.952e+01  2.053e+01   2.899 0.003818 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29320 on 1148 degrees of freedom
## Multiple R-squared:  0.869,  Adjusted R-squared:  0.8667 
## F-statistic: 380.8 on 20 and 1148 DF,  p-value: < 2.2e-16

This summary is encouraging. The model only has a very slightly lower R-squared value, but the amount of variables has been reduced from 37 to 20. This will likely help with overfitting and multicollinearity. I will now check the VIF of the model to ensure that multicollinearity is not an issue.

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:purrr':
## 
##     some
#Check VIF of the model
vif(model_all_backwards)
##   MSSubClass      LotArea  OverallQual  OverallCond    YearBuilt YearRemodAdd 
##     1.492273     1.150386     3.185767     1.525783     3.108975     2.196403 
##   MasVnrArea   BsmtFinSF1   BsmtFinSF2    BsmtUnfSF    X1stFlrSF    X2ndFlrSF 
##     1.342669     4.108074     1.439863     4.034166     5.011759     4.686271 
##     HalfBath BedroomAbvGr KitchenAbvGr TotRmsAbvGrd   GarageArea   WoodDeckSF 
##     1.871522     2.297084     1.548946     4.786408     1.816705     1.165230 
##  ScreenPorch     PoolArea 
##     1.071343     1.042236

While much better than before, there are still some variables with VIFs higher than I would like, including one just barely above 5. However, when I attempted to remove some of these variables, the loss in R-squared was large. Since this model’s goal is overall prediction, and not necessarily understanding the relationships between the variables, I will keep the model as is.

Next, I will work on performing a backwards elimination manually to see if I reach simialr results.

# Perform backwards elimination manually on the model using all the numeric variables
model_all_manual <- lm(SalePrice ~ ., train_imputed)
# Print the model summary
summary(model_all_manual)
## 
## Call:
## lm(formula = SalePrice ~ ., data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -138346  -15955   -1147   14172  217521 
## 
## Coefficients: (2 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.300e+05  1.338e+06  -0.321  0.74806    
## Id            -2.309e-02  2.069e+00  -0.011  0.99110    
## MSSubClass    -1.382e+02  2.680e+01  -5.158 2.95e-07 ***
## LotFrontage    5.356e+01  4.771e+01   1.123  0.26185    
## LotArea        4.872e-01  9.231e-02   5.278 1.57e-07 ***
## OverallQual    1.530e+04  1.147e+03  13.344  < 2e-16 ***
## OverallCond    4.706e+03  9.553e+02   4.926 9.64e-07 ***
## YearBuilt      3.544e+02  6.718e+01   5.275 1.59e-07 ***
## YearRemodAdd   1.584e+02  6.469e+01   2.449  0.01446 *  
## MasVnrArea     3.315e+01  5.704e+00   5.813 7.98e-09 ***
## BsmtFinSF1     4.155e+01  4.535e+00   9.162  < 2e-16 ***
## BsmtFinSF2     1.827e+01  6.642e+00   2.751  0.00603 ** 
## BsmtUnfSF      2.097e+01  3.958e+00   5.298 1.41e-07 ***
## TotalBsmtSF           NA         NA      NA       NA    
## X1stFlrSF      7.026e+01  5.631e+00  12.477  < 2e-16 ***
## X2ndFlrSF      7.863e+01  4.748e+00  16.560  < 2e-16 ***
## LowQualFinSF   2.244e+01  1.749e+01   1.283  0.19974    
## GrLivArea             NA         NA      NA       NA    
## BsmtFullBath   1.210e+03  2.471e+03   0.489  0.62461    
## BsmtHalfBath  -8.845e+02  3.792e+03  -0.233  0.81559    
## FullBath      -3.003e+03  2.662e+03  -1.128  0.25948    
## HalfBath      -6.784e+03  2.535e+03  -2.676  0.00755 ** 
## BedroomAbvGr  -1.375e+04  1.625e+03  -8.460  < 2e-16 ***
## KitchenAbvGr  -1.119e+04  4.906e+03  -2.280  0.02279 *  
## TotRmsAbvGrd   2.969e+03  1.194e+03   2.488  0.01300 *  
## Fireplaces     2.713e+01  1.688e+03   0.016  0.98718    
## GarageYrBlt   -7.529e+00  7.052e+01  -0.107  0.91500    
## GarageCars     2.039e+03  2.724e+03   0.749  0.45425    
## GarageArea     1.344e+01  9.504e+00   1.414  0.15761    
## WoodDeckSF     1.631e+01  7.579e+00   2.152  0.03158 *  
## OpenPorchSF    1.838e+01  1.433e+01   1.283  0.19982    
## EnclosedPorch -1.422e+01  1.634e+01  -0.870  0.38453    
## X3SsnPorch    -4.598e-01  2.754e+01  -0.017  0.98668    
## ScreenPorch    2.071e+01  1.617e+01   1.280  0.20067    
## PoolArea       5.610e+01  2.089e+01   2.685  0.00735 ** 
## MiscVal       -1.505e+00  1.581e+00  -0.952  0.34139    
## MoSold        -1.331e+02  3.276e+02  -0.406  0.68454    
## YrSold        -3.106e+02  6.663e+02  -0.466  0.64118    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29400 on 1133 degrees of freedom
## Multiple R-squared:   0.87,  Adjusted R-squared:  0.866 
## F-statistic: 216.7 on 35 and 1133 DF,  p-value: < 2.2e-16

I first remove the variables with perfect multicollinearity.

# Remove variables with perfect multicollinearity
model_all_manual <- update(model_all_manual, . ~ . - TotalBsmtSF - GrLivArea)
# Print the model summary
summary(model_all_manual)
## 
## Call:
## lm(formula = SalePrice ~ Id + MSSubClass + LotFrontage + LotArea + 
##     OverallQual + OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + 
##     BsmtFinSF1 + BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + 
##     LowQualFinSF + BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + Fireplaces + 
##     GarageYrBlt + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + 
##     EnclosedPorch + X3SsnPorch + ScreenPorch + PoolArea + MiscVal + 
##     MoSold + YrSold, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -138346  -15955   -1147   14172  217521 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -4.300e+05  1.338e+06  -0.321  0.74806    
## Id            -2.309e-02  2.069e+00  -0.011  0.99110    
## MSSubClass    -1.382e+02  2.680e+01  -5.158 2.95e-07 ***
## LotFrontage    5.356e+01  4.771e+01   1.123  0.26185    
## LotArea        4.872e-01  9.231e-02   5.278 1.57e-07 ***
## OverallQual    1.530e+04  1.147e+03  13.344  < 2e-16 ***
## OverallCond    4.706e+03  9.553e+02   4.926 9.64e-07 ***
## YearBuilt      3.544e+02  6.718e+01   5.275 1.59e-07 ***
## YearRemodAdd   1.584e+02  6.469e+01   2.449  0.01446 *  
## MasVnrArea     3.315e+01  5.704e+00   5.813 7.98e-09 ***
## BsmtFinSF1     4.155e+01  4.535e+00   9.162  < 2e-16 ***
## BsmtFinSF2     1.827e+01  6.642e+00   2.751  0.00603 ** 
## BsmtUnfSF      2.097e+01  3.958e+00   5.298 1.41e-07 ***
## X1stFlrSF      7.026e+01  5.631e+00  12.477  < 2e-16 ***
## X2ndFlrSF      7.863e+01  4.748e+00  16.560  < 2e-16 ***
## LowQualFinSF   2.244e+01  1.749e+01   1.283  0.19974    
## BsmtFullBath   1.210e+03  2.471e+03   0.489  0.62461    
## BsmtHalfBath  -8.845e+02  3.792e+03  -0.233  0.81559    
## FullBath      -3.003e+03  2.662e+03  -1.128  0.25948    
## HalfBath      -6.784e+03  2.535e+03  -2.676  0.00755 ** 
## BedroomAbvGr  -1.375e+04  1.625e+03  -8.460  < 2e-16 ***
## KitchenAbvGr  -1.119e+04  4.906e+03  -2.280  0.02279 *  
## TotRmsAbvGrd   2.969e+03  1.194e+03   2.488  0.01300 *  
## Fireplaces     2.713e+01  1.688e+03   0.016  0.98718    
## GarageYrBlt   -7.529e+00  7.052e+01  -0.107  0.91500    
## GarageCars     2.039e+03  2.724e+03   0.749  0.45425    
## GarageArea     1.344e+01  9.504e+00   1.414  0.15761    
## WoodDeckSF     1.631e+01  7.579e+00   2.152  0.03158 *  
## OpenPorchSF    1.838e+01  1.433e+01   1.283  0.19982    
## EnclosedPorch -1.422e+01  1.634e+01  -0.870  0.38453    
## X3SsnPorch    -4.598e-01  2.754e+01  -0.017  0.98668    
## ScreenPorch    2.071e+01  1.617e+01   1.280  0.20067    
## PoolArea       5.610e+01  2.089e+01   2.685  0.00735 ** 
## MiscVal       -1.505e+00  1.581e+00  -0.952  0.34139    
## MoSold        -1.331e+02  3.276e+02  -0.406  0.68454    
## YrSold        -3.106e+02  6.663e+02  -0.466  0.64118    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29400 on 1133 degrees of freedom
## Multiple R-squared:   0.87,  Adjusted R-squared:  0.866 
## F-statistic: 216.7 on 35 and 1133 DF,  p-value: < 2.2e-16

Next we reomove some variables with high p-values. ID should have been dropped right away. I will remove ID and LotArea

# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - Id - LotArea)
# Print the model summary
summary(model_all_manual)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + BsmtHalfBath + FullBath + HalfBath + BedroomAbvGr + 
##     KitchenAbvGr + TotRmsAbvGrd + Fireplaces + GarageYrBlt + 
##     GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + 
##     X3SsnPorch + ScreenPorch + PoolArea + MiscVal + MoSold + 
##     YrSold, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -138646  -16210   -1108   14215  216784 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.505e+05  1.353e+06  -0.185  0.85316    
## MSSubClass    -1.343e+02  2.708e+01  -4.960 8.11e-07 ***
## LotFrontage    1.164e+02  4.670e+01   2.492  0.01285 *  
## OverallQual    1.479e+04  1.154e+03  12.819  < 2e-16 ***
## OverallCond    4.633e+03  9.653e+02   4.799 1.81e-06 ***
## YearBuilt      3.366e+02  6.785e+01   4.960 8.12e-07 ***
## YearRemodAdd   1.614e+02  6.540e+01   2.467  0.01376 *  
## MasVnrArea     3.336e+01  5.760e+00   5.791 9.03e-09 ***
## BsmtFinSF1     4.287e+01  4.577e+00   9.368  < 2e-16 ***
## BsmtFinSF2     2.116e+01  6.688e+00   3.164  0.00160 ** 
## BsmtUnfSF      2.194e+01  3.997e+00   5.488 5.01e-08 ***
## X1stFlrSF      7.112e+01  5.682e+00  12.517  < 2e-16 ***
## X2ndFlrSF      7.951e+01  4.798e+00  16.571  < 2e-16 ***
## LowQualFinSF   2.182e+01  1.766e+01   1.235  0.21706    
## BsmtFullBath   1.947e+03  2.495e+03   0.781  0.43521    
## BsmtHalfBath  -1.444e+02  3.832e+03  -0.038  0.96996    
## FullBath      -2.153e+03  2.687e+03  -0.801  0.42323    
## HalfBath      -6.923e+03  2.561e+03  -2.704  0.00696 ** 
## BedroomAbvGr  -1.376e+04  1.643e+03  -8.378  < 2e-16 ***
## KitchenAbvGr  -1.217e+04  4.952e+03  -2.458  0.01412 *  
## TotRmsAbvGrd   2.835e+03  1.207e+03   2.349  0.01899 *  
## Fireplaces     1.078e+03  1.695e+03   0.636  0.52507    
## GarageYrBlt   -1.503e+01  7.128e+01  -0.211  0.83303    
## GarageCars     2.497e+03  2.754e+03   0.907  0.36478    
## GarageArea     1.284e+01  9.609e+00   1.336  0.18171    
## WoodDeckSF     1.958e+01  7.632e+00   2.565  0.01044 *  
## OpenPorchSF    1.790e+01  1.449e+01   1.235  0.21702    
## EnclosedPorch -1.776e+01  1.651e+01  -1.075  0.28249    
## X3SsnPorch    -2.366e+00  2.782e+01  -0.085  0.93224    
## ScreenPorch    2.034e+01  1.636e+01   1.244  0.21383    
## PoolArea       5.262e+01  2.107e+01   2.497  0.01268 *  
## MiscVal       -1.132e+00  1.597e+00  -0.709  0.47860    
## MoSold        -1.808e+02  3.310e+02  -0.546  0.58497    
## YrSold        -3.777e+02  6.737e+02  -0.561  0.57519    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29730 on 1135 degrees of freedom
## Multiple R-squared:  0.8668, Adjusted R-squared:  0.863 
## F-statistic: 223.9 on 33 and 1135 DF,  p-value: < 2.2e-16

Notably, the r-squared didn’t change much. I will now remove the next two highest p-values, which are BsmtHalfBath and X3SsnPorch

# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - BsmtHalfBath - X3SsnPorch)
# Print the model summary
summary(model_all_manual)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + 
##     TotRmsAbvGrd + Fireplaces + GarageYrBlt + GarageCars + GarageArea + 
##     WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + 
##     PoolArea + MiscVal + MoSold + YrSold, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -138628  -16206   -1146   14263  216779 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.485e+05  1.350e+06  -0.184  0.85391    
## MSSubClass    -1.344e+02  2.699e+01  -4.981 7.30e-07 ***
## LotFrontage    1.159e+02  4.636e+01   2.499  0.01258 *  
## OverallQual    1.479e+04  1.152e+03  12.844  < 2e-16 ***
## OverallCond    4.627e+03  9.604e+02   4.818 1.64e-06 ***
## YearBuilt      3.365e+02  6.775e+01   4.967 7.84e-07 ***
## YearRemodAdd   1.612e+02  6.525e+01   2.470  0.01365 *  
## MasVnrArea     3.335e+01  5.746e+00   5.804 8.39e-09 ***
## BsmtFinSF1     4.285e+01  4.521e+00   9.478  < 2e-16 ***
## BsmtFinSF2     2.115e+01  6.634e+00   3.189  0.00147 ** 
## BsmtUnfSF      2.194e+01  3.993e+00   5.493 4.87e-08 ***
## X1stFlrSF      7.110e+01  5.670e+00  12.541  < 2e-16 ***
## X2ndFlrSF      7.951e+01  4.792e+00  16.591  < 2e-16 ***
## LowQualFinSF   2.181e+01  1.765e+01   1.236  0.21682    
## BsmtFullBath   1.974e+03  2.384e+03   0.828  0.40789    
## FullBath      -2.152e+03  2.676e+03  -0.804  0.42142    
## HalfBath      -6.922e+03  2.556e+03  -2.708  0.00688 ** 
## BedroomAbvGr  -1.376e+04  1.633e+03  -8.426  < 2e-16 ***
## KitchenAbvGr  -1.216e+04  4.945e+03  -2.458  0.01411 *  
## TotRmsAbvGrd   2.839e+03  1.205e+03   2.357  0.01859 *  
## Fireplaces     1.079e+03  1.693e+03   0.637  0.52407    
## GarageYrBlt   -1.501e+01  7.120e+01  -0.211  0.83306    
## GarageCars     2.492e+03  2.751e+03   0.906  0.36515    
## GarageArea     1.286e+01  9.597e+00   1.340  0.18041    
## WoodDeckSF     1.960e+01  7.605e+00   2.577  0.01009 *  
## OpenPorchSF    1.795e+01  1.447e+01   1.240  0.21508    
## EnclosedPorch -1.771e+01  1.648e+01  -1.074  0.28283    
## ScreenPorch    2.040e+01  1.633e+01   1.249  0.21185    
## PoolArea       5.262e+01  2.105e+01   2.499  0.01259 *  
## MiscVal       -1.131e+00  1.595e+00  -0.709  0.47856    
## MoSold        -1.820e+02  3.305e+02  -0.551  0.58183    
## YrSold        -3.784e+02  6.718e+02  -0.563  0.57331    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29700 on 1137 degrees of freedom
## Multiple R-squared:  0.8668, Adjusted R-squared:  0.8632 
## F-statistic: 238.8 on 31 and 1137 DF,  p-value: < 2.2e-16

Again, the r-squared didn’t change much. I will now remove the next two highest p-values, which are GarageYrBlt and Fireplaces.

# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - GarageYrBlt - Fireplaces)
# Print the model summary
summary(model_all_manual)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + 
##     TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + 
##     EnclosedPorch + ScreenPorch + PoolArea + MiscVal + MoSold + 
##     YrSold, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -138289  -16444   -1341   14260  216525 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -2.521e+05  1.348e+06  -0.187  0.85167    
## MSSubClass    -1.340e+02  2.692e+01  -4.976 7.49e-07 ***
## LotFrontage    1.194e+02  4.594e+01   2.598  0.00949 ** 
## OverallQual    1.489e+04  1.141e+03  13.044  < 2e-16 ***
## OverallCond    4.655e+03  9.571e+02   4.864 1.31e-06 ***
## YearBuilt      3.268e+02  5.795e+01   5.640 2.15e-08 ***
## YearRemodAdd   1.544e+02  6.330e+01   2.439  0.01486 *  
## MasVnrArea     3.345e+01  5.734e+00   5.834 7.04e-09 ***
## BsmtFinSF1     4.285e+01  4.517e+00   9.486  < 2e-16 ***
## BsmtFinSF2     2.107e+01  6.628e+00   3.179  0.00152 ** 
## BsmtUnfSF      2.178e+01  3.983e+00   5.468 5.59e-08 ***
## X1stFlrSF      7.211e+01  5.475e+00  13.171  < 2e-16 ***
## X2ndFlrSF      7.994e+01  4.749e+00  16.834  < 2e-16 ***
## LowQualFinSF   2.097e+01  1.749e+01   1.199  0.23069    
## BsmtFullBath   2.003e+03  2.382e+03   0.841  0.40074    
## FullBath      -2.232e+03  2.670e+03  -0.836  0.40351    
## HalfBath      -6.812e+03  2.550e+03  -2.672  0.00766 ** 
## BedroomAbvGr  -1.386e+04  1.623e+03  -8.540  < 2e-16 ***
## KitchenAbvGr  -1.252e+04  4.892e+03  -2.559  0.01062 *  
## TotRmsAbvGrd   2.833e+03  1.203e+03   2.354  0.01872 *  
## GarageCars     2.651e+03  2.733e+03   0.970  0.33212    
## GarageArea     1.159e+01  9.189e+00   1.262  0.20730    
## WoodDeckSF     1.982e+01  7.542e+00   2.628  0.00871 ** 
## OpenPorchSF    1.816e+01  1.445e+01   1.257  0.20902    
## EnclosedPorch -1.802e+01  1.646e+01  -1.094  0.27405    
## ScreenPorch    2.185e+01  1.618e+01   1.351  0.17697    
## PoolArea       5.285e+01  2.103e+01   2.514  0.01209 *  
## MiscVal       -1.106e+00  1.593e+00  -0.694  0.48782    
## MoSold        -1.764e+02  3.301e+02  -0.534  0.59329    
## YrSold        -3.755e+02  6.713e+02  -0.559  0.57602    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29680 on 1139 degrees of freedom
## Multiple R-squared:  0.8668, Adjusted R-squared:  0.8634 
## F-statistic: 255.6 on 29 and 1139 DF,  p-value: < 2.2e-16

Again, the r-squared didn’t change much but fewer and fewer variables. I will now remove the next two highest p-values, which are MoSold and YrSold.

# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - MoSold - YrSold)
# Print the model summary
summary(model_all_manual)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     BsmtFullBath + FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + 
##     TotRmsAbvGrd + GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + 
##     EnclosedPorch + ScreenPorch + PoolArea + MiscVal, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -137305  -15921   -1231   14168  216965 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.004e+06  1.248e+05  -8.045 2.14e-15 ***
## MSSubClass    -1.335e+02  2.690e+01  -4.963 8.01e-07 ***
## LotFrontage    1.177e+02  4.585e+01   2.567  0.01038 *  
## OverallQual    1.485e+04  1.139e+03  13.036  < 2e-16 ***
## OverallCond    4.646e+03  9.564e+02   4.858 1.35e-06 ***
## YearBuilt      3.269e+02  5.792e+01   5.644 2.10e-08 ***
## YearRemodAdd   1.529e+02  6.320e+01   2.420  0.01568 *  
## MasVnrArea     3.347e+01  5.730e+00   5.841 6.76e-09 ***
## BsmtFinSF1     4.302e+01  4.506e+00   9.546  < 2e-16 ***
## BsmtFinSF2     2.113e+01  6.621e+00   3.192  0.00145 ** 
## BsmtUnfSF      2.192e+01  3.976e+00   5.512 4.38e-08 ***
## X1stFlrSF      7.200e+01  5.466e+00  13.171  < 2e-16 ***
## X2ndFlrSF      7.986e+01  4.744e+00  16.835  < 2e-16 ***
## LowQualFinSF   2.141e+01  1.747e+01   1.226  0.22043    
## BsmtFullBath   1.932e+03  2.371e+03   0.815  0.41532    
## FullBath      -2.257e+03  2.668e+03  -0.846  0.39773    
## HalfBath      -6.806e+03  2.547e+03  -2.672  0.00766 ** 
## BedroomAbvGr  -1.388e+04  1.619e+03  -8.570  < 2e-16 ***
## KitchenAbvGr  -1.275e+04  4.877e+03  -2.615  0.00904 ** 
## TotRmsAbvGrd   2.887e+03  1.200e+03   2.406  0.01629 *  
## GarageCars     2.705e+03  2.729e+03   0.991  0.32178    
## GarageArea     1.167e+01  9.179e+00   1.271  0.20389    
## WoodDeckSF     1.969e+01  7.535e+00   2.614  0.00908 ** 
## OpenPorchSF    1.810e+01  1.443e+01   1.255  0.20985    
## EnclosedPorch -1.767e+01  1.644e+01  -1.075  0.28270    
## ScreenPorch    2.145e+01  1.616e+01   1.327  0.18463    
## PoolArea       5.405e+01  2.094e+01   2.581  0.00998 ** 
## MiscVal       -1.097e+00  1.592e+00  -0.689  0.49090    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29670 on 1141 degrees of freedom
## Multiple R-squared:  0.8667, Adjusted R-squared:  0.8636 
## F-statistic: 274.8 on 27 and 1141 DF,  p-value: < 2.2e-16

Again, the r-squared didn’t change much. I will now remove the next two highest p-values, which are MiscVal and BsmtFullBath.

# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - MiscVal - BsmtFullBath)
# Print the model summary
summary(model_all_manual)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     FullBath + HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + 
##     GarageCars + GarageArea + WoodDeckSF + OpenPorchSF + EnclosedPorch + 
##     ScreenPorch + PoolArea, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -137270  -16314   -1359   14160  217278 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -1.013e+06  1.237e+05  -8.190 6.94e-16 ***
## MSSubClass    -1.309e+02  2.677e+01  -4.891 1.15e-06 ***
## LotFrontage    1.177e+02  4.583e+01   2.568 0.010341 *  
## OverallQual    1.485e+04  1.138e+03  13.045  < 2e-16 ***
## OverallCond    4.552e+03  9.521e+02   4.781 1.97e-06 ***
## YearBuilt      3.267e+02  5.781e+01   5.651 2.02e-08 ***
## YearRemodAdd   1.583e+02  6.288e+01   2.517 0.011961 *  
## MasVnrArea     3.324e+01  5.715e+00   5.817 7.79e-09 ***
## BsmtFinSF1     4.445e+01  4.087e+00  10.876  < 2e-16 ***
## BsmtFinSF2     2.222e+01  6.464e+00   3.437 0.000609 ***
## BsmtUnfSF      2.184e+01  3.973e+00   5.497 4.75e-08 ***
## X1stFlrSF      7.196e+01  5.456e+00  13.189  < 2e-16 ***
## X2ndFlrSF      7.975e+01  4.741e+00  16.823  < 2e-16 ***
## LowQualFinSF   2.147e+01  1.746e+01   1.230 0.219040    
## FullBath      -2.508e+03  2.647e+03  -0.947 0.343687    
## HalfBath      -6.890e+03  2.544e+03  -2.708 0.006862 ** 
## BedroomAbvGr  -1.383e+04  1.617e+03  -8.551  < 2e-16 ***
## KitchenAbvGr  -1.288e+04  4.860e+03  -2.651 0.008136 ** 
## TotRmsAbvGrd   2.872e+03  1.197e+03   2.399 0.016581 *  
## GarageCars     2.798e+03  2.726e+03   1.027 0.304776    
## GarageArea     1.168e+01  9.174e+00   1.273 0.203273    
## WoodDeckSF     2.011e+01  7.515e+00   2.676 0.007560 ** 
## OpenPorchSF    1.869e+01  1.441e+01   1.297 0.194887    
## EnclosedPorch -1.741e+01  1.642e+01  -1.061 0.289119    
## ScreenPorch    2.106e+01  1.614e+01   1.305 0.192060    
## PoolArea       5.373e+01  2.092e+01   2.568 0.010355 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29650 on 1143 degrees of freedom
## Multiple R-squared:  0.8666, Adjusted R-squared:  0.8637 
## F-statistic:   297 on 25 and 1143 DF,  p-value: < 2.2e-16

Again, the r-squared didn’t change much. I will now remove the next two highest p-values, which are FullBath and GarageCars.

# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - FullBath - GarageCars)
# Print the model summary
summary(model_all_manual)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + LowQualFinSF + 
##     HalfBath + BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea + 
##     WoodDeckSF + OpenPorchSF + EnclosedPorch + ScreenPorch + 
##     PoolArea, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -137411  -16451   -1250   13885  214724 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -9.847e+05  1.136e+05  -8.670  < 2e-16 ***
## MSSubClass    -1.309e+02  2.674e+01  -4.895 1.12e-06 ***
## LotFrontage    1.181e+02  4.582e+01   2.577 0.010088 *  
## OverallQual    1.489e+04  1.130e+03  13.177  < 2e-16 ***
## OverallCond    4.565e+03  9.502e+02   4.804 1.76e-06 ***
## YearBuilt      3.163e+02  5.477e+01   5.775 9.93e-09 ***
## YearRemodAdd   1.539e+02  6.250e+01   2.463 0.013938 *  
## MasVnrArea     3.336e+01  5.710e+00   5.843 6.69e-09 ***
## BsmtFinSF1     4.453e+01  4.063e+00  10.959  < 2e-16 ***
## BsmtFinSF2     2.218e+01  6.447e+00   3.440 0.000603 ***
## BsmtUnfSF      2.180e+01  3.971e+00   5.489 4.98e-08 ***
## X1stFlrSF      7.091e+01  5.267e+00  13.463  < 2e-16 ***
## X2ndFlrSF      7.804e+01  4.331e+00  18.020  < 2e-16 ***
## LowQualFinSF   1.977e+01  1.740e+01   1.136 0.256286    
## HalfBath      -5.944e+03  2.371e+03  -2.507 0.012304 *  
## BedroomAbvGr  -1.397e+04  1.610e+03  -8.682  < 2e-16 ***
## KitchenAbvGr  -1.362e+04  4.798e+03  -2.838 0.004618 ** 
## TotRmsAbvGrd   2.909e+03  1.196e+03   2.432 0.015160 *  
## GarageArea     1.927e+01  5.556e+00   3.468 0.000545 ***
## WoodDeckSF     2.010e+01  7.515e+00   2.675 0.007577 ** 
## OpenPorchSF    1.674e+01  1.434e+01   1.168 0.243134    
## EnclosedPorch -1.800e+01  1.641e+01  -1.097 0.273047    
## ScreenPorch    2.172e+01  1.613e+01   1.347 0.178236    
## PoolArea       5.415e+01  2.091e+01   2.589 0.009735 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29650 on 1145 degrees of freedom
## Multiple R-squared:  0.8664, Adjusted R-squared:  0.8637 
## F-statistic: 322.8 on 23 and 1145 DF,  p-value: < 2.2e-16

Again, the r-squared didn’t change much. I will now remove the next two highest p-values, which are EnclosedPorch and LowQualFinSF.

# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - EnclosedPorch - LowQualFinSF)
# Print the model summary
summary(model_all_manual)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + HalfBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea + 
##     WoodDeckSF + OpenPorchSF + ScreenPorch + PoolArea, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -137057  -16453   -1493   13910  213617 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.004e+06  1.087e+05  -9.241  < 2e-16 ***
## MSSubClass   -1.274e+02  2.661e+01  -4.788 1.91e-06 ***
## LotFrontage   1.193e+02  4.576e+01   2.606 0.009268 ** 
## OverallQual   1.481e+04  1.126e+03  13.160  < 2e-16 ***
## OverallCond   4.616e+03  9.409e+02   4.905 1.07e-06 ***
## YearBuilt     3.273e+02  5.075e+01   6.450 1.64e-10 ***
## YearRemodAdd  1.524e+02  6.238e+01   2.442 0.014745 *  
## MasVnrArea    3.326e+01  5.704e+00   5.831 7.14e-09 ***
## BsmtFinSF1    4.450e+01  4.062e+00  10.956  < 2e-16 ***
## BsmtFinSF2    2.192e+01  6.443e+00   3.403 0.000691 ***
## BsmtUnfSF     2.179e+01  3.968e+00   5.492 4.89e-08 ***
## X1stFlrSF     7.034e+01  5.251e+00  13.395  < 2e-16 ***
## X2ndFlrSF     7.718e+01  4.297e+00  17.960  < 2e-16 ***
## HalfBath     -5.930e+03  2.370e+03  -2.502 0.012481 *  
## BedroomAbvGr -1.397e+04  1.609e+03  -8.681  < 2e-16 ***
## KitchenAbvGr -1.421e+04  4.769e+03  -2.979 0.002949 ** 
## TotRmsAbvGrd  3.212e+03  1.177e+03   2.729 0.006440 ** 
## GarageArea    1.897e+01  5.554e+00   3.416 0.000658 ***
## WoodDeckSF    2.086e+01  7.488e+00   2.786 0.005425 ** 
## OpenPorchSF   1.804e+01  1.432e+01   1.260 0.207799    
## ScreenPorch   2.458e+01  1.596e+01   1.540 0.123767    
## PoolArea      5.399e+01  2.082e+01   2.593 0.009642 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29660 on 1147 degrees of freedom
## Multiple R-squared:  0.8661, Adjusted R-squared:  0.8636 
## F-statistic: 353.2 on 21 and 1147 DF,  p-value: < 2.2e-16

Again, the r-squared didn’t change much. I will now remove the next highest p-value, which is OpenPorchSF.

# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - OpenPorchSF)
# Print the model summary
summary(model_all_manual)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + HalfBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea + 
##     WoodDeckSF + ScreenPorch + PoolArea, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -137606  -16434   -1102   13637  214553 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.012e+06  1.085e+05  -9.328  < 2e-16 ***
## MSSubClass   -1.277e+02  2.661e+01  -4.798 1.82e-06 ***
## LotFrontage   1.199e+02  4.576e+01   2.619 0.008928 ** 
## OverallQual   1.490e+04  1.124e+03  13.259  < 2e-16 ***
## OverallCond   4.641e+03  9.410e+02   4.933 9.31e-07 ***
## YearBuilt     3.285e+02  5.075e+01   6.474 1.41e-10 ***
## YearRemodAdd  1.550e+02  6.236e+01   2.486 0.013053 *  
## MasVnrArea    3.294e+01  5.700e+00   5.780 9.62e-09 ***
## BsmtFinSF1    4.481e+01  4.056e+00  11.049  < 2e-16 ***
## BsmtFinSF2    2.204e+01  6.444e+00   3.419 0.000649 ***
## BsmtUnfSF     2.219e+01  3.956e+00   5.609 2.55e-08 ***
## X1stFlrSF     7.060e+01  5.249e+00  13.450  < 2e-16 ***
## X2ndFlrSF     7.768e+01  4.280e+00  18.148  < 2e-16 ***
## HalfBath     -5.817e+03  2.369e+03  -2.456 0.014211 *  
## BedroomAbvGr -1.399e+04  1.610e+03  -8.689  < 2e-16 ***
## KitchenAbvGr -1.450e+04  4.764e+03  -3.043 0.002393 ** 
## TotRmsAbvGrd  3.187e+03  1.177e+03   2.708 0.006878 ** 
## GarageArea    1.918e+01  5.553e+00   3.455 0.000571 ***
## WoodDeckSF    2.041e+01  7.481e+00   2.729 0.006457 ** 
## ScreenPorch   2.529e+01  1.595e+01   1.585 0.113174    
## PoolArea      5.424e+01  2.083e+01   2.604 0.009329 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29670 on 1148 degrees of freedom
## Multiple R-squared:  0.8659, Adjusted R-squared:  0.8636 
## F-statistic: 370.6 on 20 and 1148 DF,  p-value: < 2.2e-16

Again, the r-squared didn’t change much. I will now remove the next highest p-value, which is ScreenPorch.

# Remove variables with high p-values
model_all_manual <- update(model_all_manual, . ~ . - ScreenPorch)
# Print the model summary
summary(model_all_manual)
## 
## Call:
## lm(formula = SalePrice ~ MSSubClass + LotFrontage + OverallQual + 
##     OverallCond + YearBuilt + YearRemodAdd + MasVnrArea + BsmtFinSF1 + 
##     BsmtFinSF2 + BsmtUnfSF + X1stFlrSF + X2ndFlrSF + HalfBath + 
##     BedroomAbvGr + KitchenAbvGr + TotRmsAbvGrd + GarageArea + 
##     WoodDeckSF + PoolArea, data = train_imputed)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -138158  -15879   -1138   13677  218486 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -9.931e+05  1.079e+05  -9.203  < 2e-16 ***
## MSSubClass   -1.279e+02  2.663e+01  -4.802 1.78e-06 ***
## LotFrontage   1.165e+02  4.574e+01   2.546 0.011035 *  
## OverallQual   1.493e+04  1.124e+03  13.277  < 2e-16 ***
## OverallCond   4.726e+03  9.401e+02   5.027 5.76e-07 ***
## YearBuilt     3.240e+02  5.070e+01   6.390 2.40e-10 ***
## YearRemodAdd  1.497e+02  6.231e+01   2.402 0.016443 *  
## MasVnrArea    3.329e+01  5.699e+00   5.841 6.75e-09 ***
## BsmtFinSF1    4.501e+01  4.056e+00  11.095  < 2e-16 ***
## BsmtFinSF2    2.266e+01  6.436e+00   3.520 0.000448 ***
## BsmtUnfSF     2.226e+01  3.959e+00   5.624 2.35e-08 ***
## X1stFlrSF     7.110e+01  5.243e+00  13.560  < 2e-16 ***
## X2ndFlrSF     7.788e+01  4.281e+00  18.192  < 2e-16 ***
## HalfBath     -5.488e+03  2.361e+03  -2.324 0.020292 *  
## BedroomAbvGr -1.397e+04  1.611e+03  -8.670  < 2e-16 ***
## KitchenAbvGr -1.497e+04  4.758e+03  -3.145 0.001703 ** 
## TotRmsAbvGrd  3.178e+03  1.178e+03   2.699 0.007064 ** 
## GarageArea    1.945e+01  5.554e+00   3.501 0.000481 ***
## WoodDeckSF    1.906e+01  7.437e+00   2.563 0.010507 *  
## PoolArea      5.593e+01  2.082e+01   2.687 0.007314 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29690 on 1149 degrees of freedom
## Multiple R-squared:  0.8656, Adjusted R-squared:  0.8634 
## F-statistic: 389.5 on 19 and 1149 DF,  p-value: < 2.2e-16

All the predictors now have statistically significant p-values. Interestingly, only the last elimination made the manual model different than the backwards elimination model. Also, notable, the r-squared for eliminated model is practiacally the same as the original model with all the numeric variables. Even though, as mentioned, this model’s purpose is prediction, having a model with fewer variables is always better and less prone to overfitting. I will kepp this model as the final iteration of the model using all the numeric variables.

Next, I will work on improving the PCA model.

First, lets add a pca model with just the first two components.

# Build a regression model using the first 2 PCA components
pca_2 <- cbind(SalePrice = train_imputed$SalePrice, pca_result_all$x[, 1:2])

model_pca_2 <- lm(SalePrice ~ ., as.data.frame(pca_2))
# Print the model summary
summary(model_pca_2)
## 
## Call:
## lm(formula = SalePrice ~ ., data = as.data.frame(pca_2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -127944  -20243   -3651   15330  271638 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 181183.6      980.7 184.743  < 2e-16 ***
## PC1          25951.1      349.1  74.337  < 2e-16 ***
## PC2           1583.7      539.8   2.934  0.00342 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33530 on 1166 degrees of freedom
## Multiple R-squared:  0.826,  Adjusted R-squared:  0.8257 
## F-statistic:  2767 on 2 and 1166 DF,  p-value: < 2.2e-16

Now I will build a PCA model using backwards elimination on all the principal components.

# Perform backwards elimination on the model using all the principal components
model_pca_backwards <- step(model_pca_99)
## Start:  AIC=23084.66
## SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + 
##     PC10 + PC11 + PC12 + PC13 + PC14 + PC15 + PC16 + PC17 + PC18 + 
##     PC19 + PC20 + PC21 + PC22 + PC23 + PC24 + PC25 + PC26 + PC27 + 
##     PC28 + PC29 + PC30 + PC31 + PC32 + PC33
## 
##        Df  Sum of Sq        RSS   AIC
## - PC17  1 4.2023e+06 4.1565e+11 23083
## - PC24  1 4.2082e+07 4.1568e+11 23083
## - PC6   1 1.3189e+08 4.1577e+11 23083
## <none>               4.1564e+11 23085
## - PC11  1 8.9201e+08 4.1653e+11 23085
## - PC23  1 1.1586e+09 4.1680e+11 23086
## - PC19  1 2.0976e+09 4.1774e+11 23089
## - PC29  1 2.2335e+09 4.1788e+11 23089
## - PC14  1 2.3293e+09 4.1797e+11 23089
## - PC13  1 2.7680e+09 4.1841e+11 23090
## - PC15  1 2.8466e+09 4.1849e+11 23091
## - PC20  1 3.5338e+09 4.1918e+11 23093
## - PC18  1 4.4477e+09 4.2009e+11 23095
## - PC10  1 5.4819e+09 4.2112e+11 23098
## - PC25  1 7.1124e+09 4.2275e+11 23103
## - PC27  1 9.1426e+09 4.2478e+11 23108
## - PC2   1 9.6772e+09 4.2532e+11 23110
## - PC16  1 1.0151e+10 4.2579e+11 23111
## - PC28  1 1.5081e+10 4.3072e+11 23124
## - PC32  1 1.7119e+10 4.3276e+11 23130
## - PC7   1 1.8177e+10 4.3382e+11 23133
## - PC12  1 1.9868e+10 4.3551e+11 23137
## - PC9   1 2.1525e+10 4.3717e+11 23142
## - PC31  1 2.5240e+10 4.4088e+11 23152
## - PC22  1 2.6360e+10 4.4200e+11 23155
## - PC21  1 2.6631e+10 4.4227e+11 23155
## - PC3   1 5.5773e+10 4.7142e+11 23230
## - PC4   1 6.2564e+10 4.7821e+11 23247
## - PC8   1 6.6040e+10 4.8168e+11 23255
## - PC5   1 7.0799e+10 4.8644e+11 23267
## - PC30  1 9.1222e+10 5.0686e+11 23315
## - PC33  1 1.1814e+11 5.3379e+11 23375
## - PC26  1 2.0648e+11 6.2213e+11 23554
## - PC1   1 6.2134e+12 6.6290e+12 26320
## 
## Step:  AIC=23082.67
## SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + 
##     PC10 + PC11 + PC12 + PC13 + PC14 + PC15 + PC16 + PC18 + PC19 + 
##     PC20 + PC21 + PC22 + PC23 + PC24 + PC25 + PC26 + PC27 + PC28 + 
##     PC29 + PC30 + PC31 + PC32 + PC33
## 
##        Df  Sum of Sq        RSS   AIC
## - PC24  1 4.2082e+07 4.1569e+11 23081
## - PC6   1 1.3189e+08 4.1578e+11 23081
## <none>               4.1565e+11 23083
## - PC11  1 8.9201e+08 4.1654e+11 23083
## - PC23  1 1.1586e+09 4.1681e+11 23084
## - PC19  1 2.0976e+09 4.1774e+11 23087
## - PC29  1 2.2335e+09 4.1788e+11 23087
## - PC14  1 2.3293e+09 4.1798e+11 23087
## - PC13  1 2.7680e+09 4.1841e+11 23088
## - PC15  1 2.8466e+09 4.1849e+11 23089
## - PC20  1 3.5338e+09 4.1918e+11 23091
## - PC18  1 4.4477e+09 4.2009e+11 23093
## - PC10  1 5.4819e+09 4.2113e+11 23096
## - PC25  1 7.1124e+09 4.2276e+11 23101
## - PC27  1 9.1426e+09 4.2479e+11 23106
## - PC2   1 9.6772e+09 4.2532e+11 23108
## - PC16  1 1.0151e+10 4.2580e+11 23109
## - PC28  1 1.5081e+10 4.3073e+11 23122
## - PC32  1 1.7119e+10 4.3277e+11 23128
## - PC7   1 1.8177e+10 4.3382e+11 23131
## - PC12  1 1.9868e+10 4.3551e+11 23135
## - PC9   1 2.1525e+10 4.3717e+11 23140
## - PC31  1 2.5240e+10 4.4089e+11 23150
## - PC22  1 2.6360e+10 4.4201e+11 23153
## - PC21  1 2.6631e+10 4.4228e+11 23153
## - PC3   1 5.5773e+10 4.7142e+11 23228
## - PC4   1 6.2564e+10 4.7821e+11 23245
## - PC8   1 6.6040e+10 4.8169e+11 23253
## - PC5   1 7.0799e+10 4.8645e+11 23265
## - PC30  1 9.1222e+10 5.0687e+11 23313
## - PC33  1 1.1814e+11 5.3379e+11 23373
## - PC26  1 2.0648e+11 6.2213e+11 23552
## - PC1   1 6.2134e+12 6.6290e+12 26318
## 
## Step:  AIC=23080.79
## SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8 + PC9 + 
##     PC10 + PC11 + PC12 + PC13 + PC14 + PC15 + PC16 + PC18 + PC19 + 
##     PC20 + PC21 + PC22 + PC23 + PC25 + PC26 + PC27 + PC28 + PC29 + 
##     PC30 + PC31 + PC32 + PC33
## 
##        Df  Sum of Sq        RSS   AIC
## - PC6   1 1.3189e+08 4.1582e+11 23079
## <none>               4.1569e+11 23081
## - PC11  1 8.9201e+08 4.1658e+11 23081
## - PC23  1 1.1586e+09 4.1685e+11 23082
## - PC19  1 2.0976e+09 4.1779e+11 23085
## - PC29  1 2.2335e+09 4.1792e+11 23085
## - PC14  1 2.3293e+09 4.1802e+11 23085
## - PC13  1 2.7680e+09 4.1846e+11 23087
## - PC15  1 2.8466e+09 4.1854e+11 23087
## - PC20  1 3.5338e+09 4.1922e+11 23089
## - PC18  1 4.4477e+09 4.2014e+11 23091
## - PC10  1 5.4819e+09 4.2117e+11 23094
## - PC25  1 7.1124e+09 4.2280e+11 23099
## - PC27  1 9.1426e+09 4.2483e+11 23104
## - PC2   1 9.6772e+09 4.2537e+11 23106
## - PC16  1 1.0151e+10 4.2584e+11 23107
## - PC28  1 1.5081e+10 4.3077e+11 23120
## - PC32  1 1.7119e+10 4.3281e+11 23126
## - PC7   1 1.8177e+10 4.3387e+11 23129
## - PC12  1 1.9868e+10 4.3556e+11 23133
## - PC9   1 2.1525e+10 4.3721e+11 23138
## - PC31  1 2.5240e+10 4.4093e+11 23148
## - PC22  1 2.6360e+10 4.4205e+11 23151
## - PC21  1 2.6631e+10 4.4232e+11 23151
## - PC3   1 5.5773e+10 4.7146e+11 23226
## - PC4   1 6.2564e+10 4.7825e+11 23243
## - PC8   1 6.6040e+10 4.8173e+11 23251
## - PC5   1 7.0799e+10 4.8649e+11 23263
## - PC30  1 9.1222e+10 5.0691e+11 23311
## - PC33  1 1.1814e+11 5.3383e+11 23371
## - PC26  1 2.0648e+11 6.2217e+11 23550
## - PC1   1 6.2134e+12 6.6290e+12 26316
## 
## Step:  AIC=23079.16
## SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC7 + PC8 + PC9 + PC10 + 
##     PC11 + PC12 + PC13 + PC14 + PC15 + PC16 + PC18 + PC19 + PC20 + 
##     PC21 + PC22 + PC23 + PC25 + PC26 + PC27 + PC28 + PC29 + PC30 + 
##     PC31 + PC32 + PC33
## 
##        Df  Sum of Sq        RSS   AIC
## <none>               4.1582e+11 23079
## - PC11  1 8.9201e+08 4.1671e+11 23080
## - PC23  1 1.1586e+09 4.1698e+11 23080
## - PC19  1 2.0976e+09 4.1792e+11 23083
## - PC29  1 2.2335e+09 4.1805e+11 23083
## - PC14  1 2.3293e+09 4.1815e+11 23084
## - PC13  1 2.7680e+09 4.1859e+11 23085
## - PC15  1 2.8466e+09 4.1867e+11 23085
## - PC20  1 3.5338e+09 4.1935e+11 23087
## - PC18  1 4.4477e+09 4.2027e+11 23090
## - PC10  1 5.4819e+09 4.2130e+11 23093
## - PC25  1 7.1124e+09 4.2293e+11 23097
## - PC27  1 9.1426e+09 4.2496e+11 23103
## - PC2   1 9.6772e+09 4.2550e+11 23104
## - PC16  1 1.0151e+10 4.2597e+11 23105
## - PC28  1 1.5081e+10 4.3090e+11 23119
## - PC32  1 1.7119e+10 4.3294e+11 23124
## - PC7   1 1.8177e+10 4.3400e+11 23127
## - PC12  1 1.9868e+10 4.3569e+11 23132
## - PC9   1 2.1525e+10 4.3735e+11 23136
## - PC31  1 2.5240e+10 4.4106e+11 23146
## - PC22  1 2.6360e+10 4.4218e+11 23149
## - PC21  1 2.6631e+10 4.4245e+11 23150
## - PC3   1 5.5773e+10 4.7159e+11 23224
## - PC4   1 6.2564e+10 4.7838e+11 23241
## - PC8   1 6.6040e+10 4.8186e+11 23250
## - PC5   1 7.0799e+10 4.8662e+11 23261
## - PC30  1 9.1222e+10 5.0704e+11 23309
## - PC33  1 1.1814e+11 5.3397e+11 23370
## - PC26  1 2.0648e+11 6.2230e+11 23549
## - PC1   1 6.2134e+12 6.6292e+12 26314
# Print the model summary
summary(model_pca_backwards)
## 
## Call:
## lm(formula = SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC7 + 
##     PC8 + PC9 + PC10 + PC11 + PC12 + PC13 + PC14 + PC15 + PC16 + 
##     PC18 + PC19 + PC20 + PC21 + PC22 + PC23 + PC25 + PC26 + PC27 + 
##     PC28 + PC29 + PC30 + PC31 + PC32 + PC33, data = as.data.frame(pca_99))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -83749 -10669   -153  10814 139969 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 181183.6      559.1 324.074  < 2e-16 ***
## PC1          25951.1      199.0 130.401  < 2e-16 ***
## PC2           1583.7      307.7   5.146 3.13e-07 ***
## PC3          -4335.9      351.0 -12.355  < 2e-16 ***
## PC4           5096.2      389.5  13.085  < 2e-16 ***
## PC5          -6243.6      448.5 -13.920  < 2e-16 ***
## PC7          -3608.2      511.6  -7.053 3.03e-12 ***
## PC8          -7005.0      521.1 -13.444  < 2e-16 ***
## PC9           4050.7      527.8   7.675 3.54e-14 ***
## PC10          2065.1      533.2   3.873 0.000113 ***
## PC11           839.6      537.4   1.562 0.118462    
## PC12          4047.5      548.9   7.374 3.18e-13 ***
## PC13          1534.7      557.6   2.752 0.006012 ** 
## PC14          1425.4      564.6   2.525 0.011711 *  
## PC15          1616.9      579.3   2.791 0.005340 ** 
## PC16         -3069.0      582.3  -5.271 1.63e-07 ***
## PC18          2135.5      612.1   3.489 0.000504 ***
## PC19          1480.5      617.9   2.396 0.016739 *  
## PC20          1953.0      628.0   3.110 0.001918 ** 
## PC21         -5620.6      658.4  -8.537  < 2e-16 ***
## PC22         -5760.9      678.3  -8.494  < 2e-16 ***
## PC23          1260.7      708.0   1.781 0.075237 .  
## PC25          3477.8      788.3   4.412 1.12e-05 ***
## PC26         20663.8      869.3  23.772  < 2e-16 ***
## PC27          4481.7      896.0   5.002 6.56e-07 ***
## PC28          6428.3     1000.6   6.424 1.94e-10 ***
## PC29         -2626.7     1062.4  -2.472 0.013567 *  
## PC30        -17238.6     1091.0 -15.800  < 2e-16 ***
## PC31        -10384.0     1249.4  -8.311 2.67e-16 ***
## PC32          8577.8     1253.2   6.845 1.25e-11 ***
## PC33         27212.4     1513.4  17.981  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19120 on 1138 degrees of freedom
## Multiple R-squared:  0.9448, Adjusted R-squared:  0.9434 
## F-statistic: 649.4 on 30 and 1138 DF,  p-value: < 2.2e-16

Lets remove the principal components with p values above 0.01.

model_pca_backwards <- update(model_pca_backwards, . ~ . - PC11 - PC14 - PC19 - PC23 - PC29)
# Print the model summary
summary(model_pca_backwards)
## 
## Call:
## lm(formula = SalePrice ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC7 + 
##     PC8 + PC9 + PC10 + PC12 + PC13 + PC15 + PC16 + PC18 + PC20 + 
##     PC21 + PC22 + PC25 + PC26 + PC27 + PC28 + PC30 + PC31 + PC32 + 
##     PC33, data = as.data.frame(pca_99))
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -86159 -11004     89  10732 139446 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 181183.6      563.7 321.436  < 2e-16 ***
## PC1          25951.1      200.6 129.340  < 2e-16 ***
## PC2           1583.7      310.3   5.104 3.88e-07 ***
## PC3          -4335.9      353.8 -12.254  < 2e-16 ***
## PC4           5096.2      392.7  12.979  < 2e-16 ***
## PC5          -6243.6      452.2 -13.806  < 2e-16 ***
## PC7          -3608.2      515.8  -6.996 4.49e-12 ***
## PC8          -7005.0      525.3 -13.334  < 2e-16 ***
## PC9           4050.7      532.1   7.613 5.60e-14 ***
## PC10          2065.1      537.5   3.842 0.000129 ***
## PC12          4047.5      553.4   7.314 4.87e-13 ***
## PC13          1534.7      562.2   2.730 0.006432 ** 
## PC15          1616.9      584.1   2.768 0.005724 ** 
## PC16         -3069.0      587.1  -5.228 2.04e-07 ***
## PC18          2135.5      617.1   3.460 0.000559 ***
## PC20          1953.0      633.1   3.085 0.002088 ** 
## PC21         -5620.6      663.8  -8.468  < 2e-16 ***
## PC22         -5760.9      683.8  -8.424  < 2e-16 ***
## PC25          3477.8      794.7   4.376 1.32e-05 ***
## PC26         20663.8      876.4  23.578  < 2e-16 ***
## PC27          4481.7      903.3   4.961 8.06e-07 ***
## PC28          6428.3     1008.8   6.372 2.70e-10 ***
## PC30        -17238.6     1100.0 -15.672  < 2e-16 ***
## PC31        -10384.0     1259.7  -8.243 4.55e-16 ***
## PC32          8577.8     1263.5   6.789 1.81e-11 ***
## PC33         27212.4     1525.8  17.835  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19270 on 1143 degrees of freedom
## Multiple R-squared:  0.9437, Adjusted R-squared:  0.9424 
## F-statistic: 765.7 on 25 and 1143 DF,  p-value: < 2.2e-16
predictions <- predict(model_all_manual, test_imputed)
# Calculate the RMSE
rmse_model_all_manual <- sqrt(mean((test_imputed$SalePrice - predictions)^2))

predictions <- predict(model_pca_2, test_imputed)
# Calculate the RMSE
rmse_model_pca_2 <- sqrt(mean((test_imputed$SalePrice - predictions)^2))

predictions <- predict(model_pca_backwards, test_imputed)
# Calculate the RMSE
rmse_model_pca_backwards <- sqrt(mean((test_imputed$SalePrice - predictions)^2))
# Combine all the models rmse and r-squared into one dataframe
model_results_new <- data.frame(Model = c("Backwards Elimination", "Top Correlated Variables", "PCA 80", "PCA 2", "PCA Backwards"),
                             RMSE = c(rmse_model_all_manual, rmse_model_top, rmse_model_pca_80, rmse_model_pca_2, rmse_model_pca_backwards),
                             R_squared = c(summary(model_all_manual)$r.squared, summary(model_top)$r.squared, summary(model_pca_80)$r.squared, summary(model_pca_2)$r.squared, summary(model_pca_backwards)$r.squared),
                             F_statistic = c(summary(model_all_manual)$fstatistic[1], summary(model_top)$fstatistic[1], summary(model_pca_80)$fstatistic[1], summary(model_pca_2)$fstatistic[1], summary(model_pca_backwards)$fstatistic[1]),
                             AIC = c(AIC(model_all_manual), AIC(model_top), AIC(model_pca_80), AIC(model_pca_2), AIC(model_pca_backwards)),
                             BIC = c(BIC(model_all_manual), BIC(model_top), BIC(model_pca_80), BIC(model_pca_2), BIC(model_pca_backwards)))

model_results_new
##                      Model      RMSE R_squared F_statistic      AIC      BIC
## 1    Backwards Elimination  27736.90 0.8655953    389.4636 27417.09 27523.43
## 2 Top Correlated Variables  33713.83 0.8090931    820.7894 27801.33 27841.84
## 3                   PCA 80 147295.33 0.8716175    433.7558 27361.50 27462.78
## 4                    PCA 2 146536.96 0.8259851   2767.2872 27685.03 27705.28
## 5            PCA Backwards 150478.54 0.9436518    765.6635 26412.87 26549.60

Since the primary goal of this model is prediction, the model with the lowest RMSE is the best model. The model with the lowest RMSE is the model using backwards elimination on all the numeric variables. While the backwards PCA model has a significantly higher r-squared and F-statistic, along with the lowest AIC and BIC, the RMSE is the most important metric for this model and the PCA models seem to perform terribly for prediction. We will select the backwards elimination model as the final model.

Predict the Evaluation Set using the Final Model

# Predict the SalePrice using the final model
predictions <- predict(model_all_manual, eval_imputed)
# Save the predictions to a CSV file
submission <- data.frame(Id = eval$Id, SalePrice = predictions)
write.csv(submission, file = "submission.csv", row.names = FALSE)
# Add PCA to the eval set
pca_eval <- prcomp(eval_imputed, scale = TRUE)
eval_imputed <- cbind(eval_imputed, pca_eval$x)
# Predict using the PCA model
predictions <- predict(model_pca_backwards, eval_imputed)
# Save the predictions to a CSV file
submission <- data.frame(Id = eval$Id, SalePrice = predictions)
write.csv(submission, file = "submission_pca.csv", row.names = FALSE)

Kaggle Submission

Kaggle User Name is Shaya Engelman. The Kaggle score for the final model was 0.36307. Not a great score, but factoring in that I didn’t include any of the categorical values (and neighborhood, for example, is obviously an important predictor of house price), it’s not terrible. I would love to revisit this competition when I have more time to add categorical variables and improve my model.