1. Data Exploration
Variables Desciption
#vars.description <- read.table(file = "C:\Users\owner\OneDrive\Documents\R\Data605_final\data_description.txt", header = TRUE)
#vars.description
train.df <- read.csv("https://raw.githubusercontent.com/asmozo24/Data605_House-Prices/main/train.csv", stringsAsFactors=FALSE)
test.df <- read.csv("https://raw.githubusercontent.com/asmozo24/Data605_House-Prices/main/test.csv", stringsAsFactors=FALSE)
#View()
str(train.df)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : chr "RL" "RL" "RL" "RL" ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : chr "Pave" "Pave" "Pave" "Pave" ...
## $ Alley : chr NA NA NA NA ...
## $ LotShape : chr "Reg" "Reg" "IR1" "IR1" ...
## $ LandContour : chr "Lvl" "Lvl" "Lvl" "Lvl" ...
## $ Utilities : chr "AllPub" "AllPub" "AllPub" "AllPub" ...
## $ LotConfig : chr "Inside" "FR2" "Inside" "Corner" ...
## $ LandSlope : chr "Gtl" "Gtl" "Gtl" "Gtl" ...
## $ Neighborhood : chr "CollgCr" "Veenker" "CollgCr" "Crawfor" ...
## $ Condition1 : chr "Norm" "Feedr" "Norm" "Norm" ...
## $ Condition2 : chr "Norm" "Norm" "Norm" "Norm" ...
## $ BldgType : chr "1Fam" "1Fam" "1Fam" "1Fam" ...
## $ HouseStyle : chr "2Story" "1Story" "2Story" "2Story" ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : chr "Gable" "Gable" "Gable" "Gable" ...
## $ RoofMatl : chr "CompShg" "CompShg" "CompShg" "CompShg" ...
## $ Exterior1st : chr "VinylSd" "MetalSd" "VinylSd" "Wd Sdng" ...
## $ Exterior2nd : chr "VinylSd" "MetalSd" "VinylSd" "Wd Shng" ...
## $ MasVnrType : chr "BrkFace" "None" "BrkFace" "None" ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : chr "Gd" "TA" "Gd" "TA" ...
## $ ExterCond : chr "TA" "TA" "TA" "TA" ...
## $ Foundation : chr "PConc" "CBlock" "PConc" "BrkTil" ...
## $ BsmtQual : chr "Gd" "Gd" "Gd" "TA" ...
## $ BsmtCond : chr "TA" "TA" "TA" "Gd" ...
## $ BsmtExposure : chr "No" "Gd" "Mn" "No" ...
## $ BsmtFinType1 : chr "GLQ" "ALQ" "GLQ" "ALQ" ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : chr "Unf" "Unf" "Unf" "Unf" ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : chr "GasA" "GasA" "GasA" "GasA" ...
## $ HeatingQC : chr "Ex" "Ex" "Ex" "Gd" ...
## $ CentralAir : chr "Y" "Y" "Y" "Y" ...
## $ Electrical : chr "SBrkr" "SBrkr" "SBrkr" "SBrkr" ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : chr "Gd" "TA" "Gd" "Gd" ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : chr "Typ" "Typ" "Typ" "Typ" ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : chr NA "TA" "TA" "Gd" ...
## $ GarageType : chr "Attchd" "Attchd" "Attchd" "Detchd" ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : chr "RFn" "RFn" "RFn" "Unf" ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : chr "TA" "TA" "TA" "TA" ...
## $ GarageCond : chr "TA" "TA" "TA" "TA" ...
## $ PavedDrive : chr "Y" "Y" "Y" "Y" ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : chr NA NA NA NA ...
## $ Fence : chr NA NA NA NA ...
## $ MiscFeature : chr NA NA NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : chr "WD" "WD" "WD" "WD" ...
## $ SaleCondition: chr "Normal" "Normal" "Normal" "Abnorml" ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
We have 1460 observation of 81 variables: This is a dataset with a large number of variable but not large overall. We want to check the data distribution of the response variable (SalePrice) and 03 other independendent variables (YearBuilt = Original construction date, LotArea = Lot size in square feet, OverallCond =Rates the overall condition of the house (1 to 10), ), We assumed by real life experience that there is a relationship between the sale price and these independent variables.
####Sample data of the new dataframe , train.df1 (04 variables selected)
train.df1 <- dplyr::select (train.df, SalePrice, YearBuilt, LotArea, OverallCond)
head(train.df1)
## SalePrice YearBuilt LotArea OverallCond
## 1 208500 2003 8450 5
## 2 181500 1976 9600 8
## 3 223500 2001 11250 5
## 4 140000 1915 9550 5
## 5 250000 2000 14260 5
## 6 143000 1993 14115 5
Summary of train.df1
library(pastecs)
## Warning:
## package
## 'pastecs'
## was built
## under R
## version
## 4.0.5
##
## Attaching package: 'pastecs'
## The following objects are masked from 'package:data.table':
##
## first,
## last
## The following object is masked from 'package:tidyr':
##
## extract
## The following objects are masked from 'package:dplyr':
##
## first,
## last
options(scipen=10)
options(digits=2)
#stat.desc(wineT_df, basic=T,desc=T, norm=FALSE, p=0.95)
stat.desc( train.df1, norm=FALSE, p=0.95) #basic=T,desc=F
## SalePrice
## nbr.val 1460.00
## nbr.null 0.00
## nbr.na 0.00
## min 34900.00
## max 755000.00
## range 720100.00
## sum 264144946.00
## median 163000.00
## mean 180921.20
## SE.mean 2079.11
## CI.mean.0.95 4078.35
## var 6311111264.30
## std.dev 79442.50
## coef.var 0.44
## YearBuilt
## nbr.val 1460.000
## nbr.null 0.000
## nbr.na 0.000
## min 1872.000
## max 2010.000
## range 138.000
## sum 2878051.000
## median 1973.000
## mean 1971.268
## SE.mean 0.790
## CI.mean.0.95 1.551
## var 912.215
## std.dev 30.203
## coef.var 0.015
## LotArea
## nbr.val 1460.00
## nbr.null 0.00
## nbr.na 0.00
## min 1300.00
## max 215245.00
## range 213945.00
## sum 15354569.00
## median 9478.50
## mean 10516.83
## SE.mean 261.22
## CI.mean.0.95 512.41
## var 99625649.65
## std.dev 9981.26
## coef.var 0.95
## OverallCond
## nbr.val 1460.000
## nbr.null 0.000
## nbr.na 0.000
## min 1.000
## max 9.000
## range 8.000
## sum 8140.000
## median 5.000
## mean 5.575
## SE.mean 0.029
## CI.mean.0.95 0.057
## var 1.238
## std.dev 1.113
## coef.var 0.200
Data distribution
#checking variable distribution
library(ggthemes)
## Warning:
## package
## 'ggthemes'
## was built
## under R
## version
## 4.0.5
plot_H <-train.df1 %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram(bins = 25) +
theme_wsj()+ scale_colour_wsj("colors6")
plot_H

Data Cleaning-Fixing variable with missing Values
Let’s see the distribution of the missing values across the training dataset.
## The total of missing values is : 0

Plotting the response variable against the independent variables.
We will add a new variable called HouseAge. Since there is no note about when this data was collected, we will pick the reference year to be 2015, this does not mean the house price reflects the market in 2010.
Scatter plots + fitted regression line
train.df2 <- train.df1 %>%
mutate(HouseAge = 2015 - YearBuilt)
View(train.df2)
plots <-function(df, x, y) {
ggplot(df, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
ggtitle("House Sale Price against Predictor") +
xlab("x") + ylab("y") +
# Change the color, the size and the face of salary~YearsExperience, data = salary_df
# the main title, x and y axis labels
theme(
plot.title = element_text(color="red", size=14, face="bold.italic"),
axis.title.x = element_text(color="blue", size=12, face="bold"),
axis.title.y = element_text(color="#993333", size=12, face="bold"))
}
print(" y = House Sale Price (in dollars) Vs. x = Original construction date")
## [1] " y = House Sale Price (in dollars) Vs. x = Original construction date"
plots (train.df2, train.df2$YearBuilt,train.df2$SalePrice)
## `geom_smooth()` using formula 'y ~ x'

print(" y = House Sale Price (in dollars) Vs. x = House Age (in years) ")
## [1] " y = House Sale Price (in dollars) Vs. x = House Age (in years) "
plots (train.df2, train.df2$HouseAge,train.df2$SalePrice)
## `geom_smooth()` using formula 'y ~ x'

print(" y = House Sale Price (in dollars) Vs. x = Rates the overall condition of the house (scale 1 to 10)")
## [1] " y = House Sale Price (in dollars) Vs. x = Rates the overall condition of the house (scale 1 to 10)"
plots (train.df2, train.df2$OverallCond,train.df2$SalePrice)
## `geom_smooth()` using formula 'y ~ x'

print(" y = House Sale Price (in dollars) Vs. x = Lot size (in square feet)")
## [1] " y = House Sale Price (in dollars) Vs. x = Lot size (in square feet)"
plots (train.df2, train.df2$LotArea,train.df2$SalePrice)
## `geom_smooth()` using formula 'y ~ x'

The scatter plots show weak linear relationship between the house sale price and the predictors excepted the original built year or house age. We actually think this might be a negative exponential relationship between the house sale price and the built year. There is also some linear relationship between house sale price and the overall condition. There are many outliers.
Correlation
library(corrr)
## Warning: package 'corrr' was built under R version 4.0.5
cor1 <- cor(train.df2, method = "pearson", use = "complete.obs")
kable(cor1)
SalePrice |
1.00 |
0.52 |
0.26 |
-0.08 |
-0.52 |
YearBuilt |
0.52 |
1.00 |
0.01 |
-0.38 |
-1.00 |
LotArea |
0.26 |
0.01 |
1.00 |
-0.01 |
-0.01 |
OverallCond |
-0.08 |
-0.38 |
-0.01 |
1.00 |
0.38 |
HouseAge |
-0.52 |
-1.00 |
-0.01 |
0.38 |
1.00 |
ggcorr(train.df2, label = TRUE , label_alpha =TRUE )

#corrplot(train_m, order="hclust", type="upper", tl.col="grey", tl.srt=45, addCoef.col = "yellow")
The correlation output does not show anything strong. Again, just as we mentioned above there is some tangible relationship between house sale price and built year or house age: correlation = 0.52. not bad!
Hypothesis Testing
Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide an 80% confidence interval.
Null Hypothesis: the correlation between house sale price and built year is 0.
Alternative Hypothesis: the correlation between house sale price and built is not 0.
cor.test(train.df2$YearBuilt, train.df2$SalePrice, method = "pearson" , conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train.df2$YearBuilt and train.df2$SalePrice
## t = 23, df = 1458, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.50 0.55
## sample estimates:
## cor
## 0.52
p-value <2e-16 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the #### Null hypothesis. The true correlation is not equal to 0.
Null Hypothesis: the correlation between house sale price and built year is 0.
Alternative Hypothesis: the correlation between house sale price and built year is not 0.
cor.test(train.df2$YearBuilt, train.df2$SalePrice, method = "pearson" , conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train.df2$YearBuilt and train.df2$SalePrice
## t = 23, df = 1458, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.50 0.55
## sample estimates:
## cor
## 0.52
p-value <2e-16 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the #### Null hypothesis. The true correlation is not equal to 0.
House Sale Price and Lot Size
Null Hypothesis: the correlation between house sale price and lot size is 0.
Alternative Hypothesis: the correlation between house sale price and lot size is not 0.
cor.test(train.df2$LotArea, train.df2$SalePrice, method = "pearson" , conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train.df2$LotArea and train.df2$SalePrice
## t = 10, df = 1458, p-value <2e-16
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## 0.23 0.29
## sample estimates:
## cor
## 0.26
p-value <2e-16 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the Null hypothesis. The true correlation is not equal to 0. However, we observed a weak correlation.
House Sale Price and Overall House Condition
Null Hypothesis: the correlation between house sale price and the overall house condition is 0.
Alternative Hypothesis: the correlation between house sale price and overall house condition is not 0.
cor.test(train.df2$OverallCond, train.df2$SalePrice, method = "pearson" , conf.level = 0.80)
##
## Pearson's product-moment correlation
##
## data: train.df2$OverallCond and train.df2$SalePrice
## t = -3, df = 1458, p-value = 0.003
## alternative hypothesis: true correlation is not equal to 0
## 80 percent confidence interval:
## -0.111 -0.044
## sample estimates:
## cor
## -0.078
p-value = 0.003 (less than 0.2 threshold) is highly significant to 80% CI. Thus, we reject the Null hypothesis. The true correlation is not equal to 0. However, we observed a very weak correlation.
Familywise Error Rate (Alpha Inflation)
The familywise error (FWE or FWER) is the probability of making at least one false conclusion in a series of hypothesis tests. We won’t worry about it because the found p-values are highly significant.
Linear Algebra and Correlation.
Invert your correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.)
library(matlib)
## Warning: package 'matlib' was built under R version 4.0.5
library(Matrix)
library(matrixcalc)
##
## Attaching package: 'matrixcalc'
## The following object is masked from 'package:matlib':
##
## vec
(cor2 <- Ginv(cor1))
##
## [1,] 1.55 -0.89 -0.398 -0.215 0
## [2,] -0.89 1.67 0.213 0.561 0
## [3,] -0.40 0.21 1.102 0.055 0
## [4,] -0.21 0.56 0.055 1.194 0
## [5,] 0.00 0.00 0.000 0.000 0
Multiply the correlation matrix by the precision matrix
(cor1 %*% cor2)
##
## SalePrice 0.999999999675 0.0000000053 0.00000000102 0.00000000063 0
## YearBuilt 0.000000001686 1.0000000053 0.00000000197 0.00000000102 0
## LotArea 0.000000000041 0.0000000028 0.99999999839 0.00000000071 0
## OverallCond -0.000000000668 -0.0000000015 -0.00000000013 0.99999999800 0
## HouseAge -0.000000001686 -1.0000000053 -0.00000000197 -0.00000000102 0
Multiply the precision matrix by the correlation matrix.
(cor2 %*% cor1)
## SalePrice YearBuilt LotArea OverallCond HouseAge
## [1,] 0.99999999968 0.0000000017 0.000000000041 -0.00000000067 -0.0000000017
## [2,] 0.00000000531 1.0000000053 0.000000002758 -0.00000000153 -1.0000000053
## [3,] 0.00000000102 0.0000000020 0.999999998394 -0.00000000013 -0.0000000020
## [4,] 0.00000000063 0.0000000010 0.000000000713 0.99999999800 -0.0000000010
## [5,] 0.00000000000 0.0000000000 0.000000000000 0.00000000000 0.0000000000
For some reason, we don’t see identity matrix when we did inverted correlation and the precision matrix. we think the issue is with round () that we should have applied.
Conduct LU decomposition on the matrix.
lucor1 <- lu.decomposition( cor1 )
lucor1
## $L
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1.000 0.00 0.000 0 0
## [2,] 0.523 1.00 0.000 0 0
## [3,] 0.264 -0.17 1.000 0 0
## [4,] -0.078 -0.46 -0.046 1 0
## [5,] -0.523 -1.00 0.000 0 1
##
## $U
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 0.52 0.26 -0.078 -0.52
## [2,] 0 0.73 -0.12 -0.335 -0.73
## [3,] 0 0.00 0.91 -0.042 0.00
## [4,] 0 0.00 0.00 0.837 0.00
## [5,] 0 0.00 0.00 0.000 0.00
Double checking Lu decomposition
if ((lucor1$L %*% lucor1$U) == cor1) {
print("Good LU decomposition")
}else {
print("Lu decomposition needs a review")
}
## Warning in if ((lucor1$L %*% lucor1$U) == cor1) {: the condition has length > 1
## and only the first element will be used
## [1] "Good LU decomposition"
Modeling.
We have too many missing values. Some variables have nearly 100% missing values. So, we will remove those variable with 45% or more missing value. second , we will impute those variable with less than 40% missing values using median.
These variables will be removed: PoolQC, MiscFeature ,Alley, Fence, FireplaceQu
These variables will go under imputation: LotFrontage, GarageCond, GarageFinish, GarageQual,GarageType, GarageYrBlt,BsmtExposure , BsmtFinType2, BsmtCond, BsmtFinType1,BsmtQual ,MasVnrArea ,MasVnrType, Electrical .
Since Multiple linear regression is based on numerical values, we could transform the variables with character data type into numeric. Due to high volume, we want to focus directly on selecting the variables with numeric data type.
We will use lm() to build our model and use back elimination (remove variable with less significance )
library(Hmisc)
## Warning: package 'Hmisc' was built under R version 4.0.5
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
train.df3 <- dplyr::select (train.df,-Id, -PoolQC, -MiscFeature ,-Alley, -Fence, -FireplaceQu)
# impute missing value by median
train.df3$LotFrontage <- impute( train.df3$LotFrontage , median)
train.df3$GarageCond <- impute( train.df3$GarageCond , median)
train.df3$GarageFinish <- impute( train.df3$GarageFinish , median)
train.df3$GarageQual <- impute( train.df3$GarageQual , median)
train.df3$GarageType <- impute( train.df3$GarageType , median)
train.df3$GarageYrBlt <- impute( train.df3$GarageYrBlt , median)
train.df3$BsmtExposure <- impute( train.df3$BsmtExposure , median)
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
train.df3$BsmtFinType2 <- impute( train.df3$BsmtFinType2 , median)
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
train.df3$BsmtCond <- impute( train.df3$BsmtCond , median)
train.df3$BsmtFinType1 <- impute( train.df3$BsmtFinType1 , median)
train.df3$BsmtQual <- impute( train.df3$BsmtQual , median)
train.df3$MasVnrArea <- impute( train.df3$MasVnrArea , median)
train.df3$MasVnrType <- impute( train.df3$MasVnrType , median)
## Warning in mean.default(sort(x, partial = half + 0L:1L)[half + 0L:1L]): argument
## is not numeric or logical: returning NA
train.df3$Electrical <- impute( train.df3$Electrical , median)
missing.values(train.df3)%>% kable()
## Warning: attributes are not identical across measure variables;
## they will be dropped
## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.
BsmtExposure |
38 |
BsmtFinType2 |
38 |
MasVnrType |
8 |
class(train.df3$BsmtExposure)
## [1] "impute"
#data.class(train.df3$BsmtExposure)
# View(train.df3)
#train.df3[complete.cases(train.df3), ]
train.df4 <- na.omit(train.df3)
dim(train.df4)
## [1] 1413 75
missing.values(train.df4)%>% kable()
## Warning: attributes are not identical across measure variables;
## they will be dropped
## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.
gg_miss_var(train.df4, show_pct = TRUE) + labs(y = "Missing Values in % to total record")+ theme_update()

# We could transform some of these variable into numberic, instead we will select the one that are already numberical
train.df5 <- dplyr::select_if(train.df4, is.numeric)
lm1 <- lm(SalePrice ~., data = train.df5)
summary(lm1)
##
## Call:
## lm(formula = SalePrice ~ ., data = train.df5)
##
## Residuals:
## Min 1Q Median 3Q Max
## -479749 -16470 -2345 13831 302053
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 120649.949 1444577.512 0.08 0.93345
## MSSubClass -176.715 28.256 -6.25 0.00000000053 ***
## LotFrontage -72.718 52.305 -1.39 0.16467
## LotArea 0.433 0.103 4.21 0.00002706031 ***
## OverallQual 17391.550 1211.302 14.36 < 2e-16 ***
## OverallCond 4910.814 1055.495 4.65 0.00000359261 ***
## YearBuilt 282.119 69.490 4.06 0.00005186088 ***
## YearRemodAdd 166.463 70.430 2.36 0.01824 *
## MasVnrArea 30.792 6.019 5.12 0.00000035689 ***
## BsmtFinSF1 24.321 6.097 3.99 0.00006979028 ***
## BsmtFinSF2 14.110 8.206 1.72 0.08576 .
## BsmtUnfSF 15.219 5.940 2.56 0.01051 *
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 44.820 6.894 6.50 0.00000000011 ***
## X2ndFlrSF 49.724 5.065 9.82 < 2e-16 ***
## LowQualFinSF 26.110 20.060 1.30 0.19328
## GrLivArea NA NA NA NA
## BsmtFullBath 9801.366 2632.104 3.72 0.00020 ***
## BsmtHalfBath 2242.247 4108.576 0.55 0.58533
## FullBath 1901.747 2908.400 0.65 0.51330
## HalfBath -2633.272 2708.924 -0.97 0.33118
## BedroomAbvGr -9261.867 1740.810 -5.32 0.00000012069 ***
## KitchenAbvGr -15531.065 5777.685 -2.69 0.00727 **
## TotRmsAbvGrd 5430.652 1262.501 4.30 0.00001816001 ***
## Fireplaces 4443.424 1810.613 2.45 0.01425 *
## GarageYrBlt 109.794 70.253 1.56 0.11832
## GarageCars 11075.196 2923.323 3.79 0.00016 ***
## GarageArea -3.324 10.069 -0.33 0.74135
## WoodDeckSF 22.148 8.109 2.73 0.00639 **
## OpenPorchSF -8.946 15.378 -0.58 0.56084
## EnclosedPorch 14.707 17.191 0.86 0.39240
## X3SsnPorch 24.730 31.899 0.78 0.43832
## ScreenPorch 55.364 17.258 3.21 0.00137 **
## PoolArea -30.882 23.906 -1.29 0.19664
## MiscVal -1.136 1.894 -0.60 0.54876
## MoSold -22.920 351.524 -0.07 0.94802
## YrSold -638.413 716.910 -0.89 0.37335
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34800 on 1378 degrees of freedom
## Multiple R-squared: 0.812, Adjusted R-squared: 0.807
## F-statistic: 174 on 34 and 1378 DF, p-value: <2e-16
#plot(lm1)
##Mmodel 2 with only the significant vaiables
# train.df5 <- dplyr::select (train.df4,MSZoning ,LotArea , Street , LotConfig,LandSlope ,Condition1, Neighborhood, HouseStyle, OverallQual , OverallCond,YearBuilt ,RoofStyle ,RoofMatl, Exterior2nd, MasVnrArea,ExterQual,BsmtQual ,BsmtCond , BsmtExposure , BsmtFinType2, BsmtFinSF2 , BsmtUnfSF ,X1stFlrSF, X2ndFlrSF, BedroomAbvGr, KitchenAbvGr, KitchenQual,Functional, GarageQual,SalePrice)
Model 2
train.df6 <- dplyr::select (train.df5,MSSubClass , LotArea , OverallQual , OverallCond,YearBuilt, YearRemodAdd , MasVnrArea,BsmtFinSF1, BsmtUnfSF, X1stFlrSF, X2ndFlrSF,BsmtFullBath, BedroomAbvGr, KitchenAbvGr,TotRmsAbvGrd,Fireplaces, GarageCars, WoodDeckSF,ScreenPorch,SalePrice)
lm2 <- lm(SalePrice ~., data = train.df6)
summary(lm2)
##
## Call:
## lm(formula = SalePrice ~ ., data = train.df6)
##
## Residuals:
## Min 1Q Median 3Q Max
## -496010 -16475 -2014 13623 287203
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1072422.887 118493.033 -9.05 < 2e-16 ***
## MSSubClass -157.617 26.193 -6.02 0.0000000023 ***
## LotArea 0.424 0.100 4.23 0.0000254270 ***
## OverallQual 17923.757 1191.524 15.04 < 2e-16 ***
## OverallCond 4374.403 1028.318 4.25 0.0000224056 ***
## YearBuilt 305.931 52.920 5.78 0.0000000092 ***
## YearRemodAdd 206.456 66.438 3.11 0.00192 **
## MasVnrArea 30.334 5.957 5.09 0.0000004026 ***
## BsmtFinSF1 16.861 4.473 3.77 0.00017 ***
## BsmtUnfSF 8.674 4.434 1.96 0.05065 .
## X1stFlrSF 49.624 5.556 8.93 < 2e-16 ***
## X2ndFlrSF 46.129 4.170 11.06 < 2e-16 ***
## BsmtFullBath 9725.108 2445.664 3.98 0.0000735367 ***
## BedroomAbvGr -9163.348 1696.903 -5.40 0.0000000782 ***
## KitchenAbvGr -16145.471 5650.937 -2.86 0.00434 **
## TotRmsAbvGrd 5692.693 1232.527 4.62 0.0000042177 ***
## Fireplaces 3577.771 1770.169 2.02 0.04346 *
## GarageCars 10446.171 1731.250 6.03 0.0000000020 ***
## WoodDeckSF 23.510 7.976 2.95 0.00326 **
## ScreenPorch 52.485 16.945 3.10 0.00199 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 34900 on 1393 degrees of freedom
## Multiple R-squared: 0.809, Adjusted R-squared: 0.807
## F-statistic: 311 on 19 and 1393 DF, p-value: <2e-16
Prediction
#str(test.df)
#View(test.df)
# cleaning up
test.df1 <- dplyr::select (test.df, Id, MSSubClass , LotArea , OverallQual , OverallCond,YearBuilt, YearRemodAdd , MasVnrArea,BsmtFinSF1, X1stFlrSF, X2ndFlrSF,BsmtFullBath, BedroomAbvGr, KitchenAbvGr,TotRmsAbvGrd, GarageCars, WoodDeckSF,ScreenPorch)
missing.values(test.df1)%>% kable()
## `summarise()` has grouped output by 'variables'. You can override using the `.groups` argument.
MasVnrArea |
15 |
BsmtFullBath |
2 |
BsmtFinSF1 |
1 |
GarageCars |
1 |
test.df2 <- test.df1
test.df2$MasVnrArea <- impute( test.df2$MasVnrArea , median)
test.df2$BsmtFullBath <- impute( test.df2$BsmtFullBath , median)
test.df2$BsmtFinSF1 <- impute( test.df2$BsmtFinSF1 , median)
test.df2$GarageCars <- impute( test.df2$GarageCars , median)
#length(test.df2)
#str(test.df2)
#length(train.df8)
#str(train.df8)
predict1 <- predict(lm4, test.df2)
#str(predict1)
SalePrice <- predict1
Id <- test.df2$Id
Predi_SalePrice <- data.frame(Id, SalePrice)
#View(Predi_SalePrice)
write.csv(Predi_SalePrice, file = "Predicting_House_Sale_Price.csv", quote = F, row.names = F)
Interpretation of summary of the Multilinear regression model 4 output
F statistics: how good is the relationship between the predictors and response (SalePrice)? In this case, F-statistic: 346 which indicates there is a relationship between the two variables, how strong we cannot confirm it.
R-squared: 0.806 or 80.6% of the variability in SalePrice is explained by YearBuilt, LotArea, MSSubClass, OverallCond, etc. Since we have more independent variables, adjusted R-squared < R-squared. Rsquared is measured on the scale from 0 to 1, R^2 = 1 being best fit. This 80.6% R-squared shows there is a good fit to the train dataset.
Standard error: errors in the models or a measure of the quality of the simple linear regression fit. We expect all regression model to carry in some errors. The actual SalePrice can deviate from the true regression line by 34900 on the independent variables (YearBuilt, LotArea, MSSubClass, OverallCond, etc.).
p-values: based on the 95% confidence interval, 2.2e-16 <0.05 which indicates that changes in the predictors (YearBuilt, LotArea, MSSubClass, OverallCond, etc.) are associated with changes in the response variable(SalePrice).