#required packages
library(MASS)
library(car)
library(corrplot)
library(PerformanceAnalytics)
library(GGally)
library(RColorBrewer)
library(VIM)
library(dplyr)
library(mice)
library(pROC)
library(caret)
library(pscl)
library(ResourceSelection)
library(stringr)
library(vcd)
library(rcompanion)
library(lmtest)
library(fitdistrplus)
library(gmodels)
library(vembedr)
Link to sumbissions.csv file(that contains SalePrice prediction for testing dataset) : https://raw.githubusercontent.com/olga0503/DATA-621/master/submission.csv
This project analyses the dataset that contains information on approximately 1,460 residential homes in Ames, Iowa. The objective of the project is to build a regression model on the training data set to predict final price of each home.
The data set contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa:
#read training data set
data <- read.csv(file=
"https://raw.githubusercontent.com/olga0503/DATA-605/master/train.csv",
stringsAsFactors=T, header=T)
#read testing data set
data_testing <- read.csv(file=
"https://raw.githubusercontent.com/olga0503/DATA-605/master/test.csv",
stringsAsFactors=T, header=T)
#display first six entries
#head(data)
#head(data_testing)
#find dimentions
dim(data)
## [1] 1460 81
‘SalePrice’ - the property’s sale price in dollars. This is the target variable that you’re trying to predict.
‘MSSubClass’ - The building class
‘MSZoning’ - The general zoning classification
‘LotFrontage’ - Linear feet of street connected to property
‘LotArea’ - Lot size in square feet
‘Street’ - Type of road access
‘Alley’ - Type of alley access
‘LotShape’ - General shape of property
‘LandContour’ - Flatness of the property
‘Utilities’ - Type of utilities available
‘LotConfig’ - Lot configuration
‘LandSlope’ - Slope of property
‘Neighborhood’ - Physical locations within Ames city limits
‘Condition1’ - Proximity to main road or railroad
‘Condition2’ - Proximity to main road or railroad (if a second is present)
‘BldgType’- Type of dwelling
‘HouseStyle’ - Style of dwelling
‘OverallQual’ - Overall material and finish quality
‘OverallCond’ - Overall condition rating
‘YearBuilt’ - Original construction date
‘YearRemodAdd’ - Remodel date
‘RoofStyle’ - Type of roof
‘RoofMatl’ - Roof material
‘Exterior1st’ - Exterior covering on house
‘Exterior2nd’ - Exterior covering on house (if more than one material)
‘MasVnrType’ - Masonry veneer type
‘MasVnrArea’ - Masonry veneer area in square feet
‘ExterQual’ - Exterior material quality
‘ExterCond’ - Present condition of the material on the exterior
‘Foundation’- Type of foundation
‘BsmtQual’ - Height of the basement
‘BsmtCond’ - General condition of the basement
‘BsmtExposure’ - Walkout or garden level basement walls
‘BsmtFinType1’ - Quality of basement finished area
‘BsmtFinSF1’ - Type 1 finished square feet
‘BsmtFinType2’ - Quality of second finished area (if present)
‘BsmtFinSF2’ - Type 2 finished square feet
‘BsmtUnfSF’ - Unfinished square feet of basement area
‘TotalBsmtSF’ - Total square feet of basement area
‘Heating’ - Type of heating
‘HeatingQC’ - Heating quality and condition
‘CentralAir’ - Central air conditioning
‘Electrical’ - Electrical system
‘1stFlrSF’ - First Floor square feet
‘2ndFlrSF’ - Second floor square feet
‘LowQualFinSF’ - Low quality finished square feet (all floors)
‘GrLivArea’ - Above grade (ground) living area square feet
‘BsmtFullBath’ - Basement full bathrooms
‘BsmtHalfBath’ - Basement half bathrooms
‘FullBath’ - Full bathrooms above grade
‘HalfBath’ - Half baths above grade
‘Bedroom’ - Number of bedrooms above basement level
‘Kitchen’ - Number of kitchens
‘KitchenQual’ - Kitchen quality
‘TotRmsAbvGrd’ - Total rooms above grade (does not include bathrooms)
‘Functional’ - Home functionality rating
‘Fireplaces’ - Number of fireplaces
‘FireplaceQu’ - Fireplace quality
‘GarageType’ - Garage location
‘GarageYrBlt’ - Year garage was built
‘GarageFinish’ - Interior finish of the garage
‘GarageCars’ - Size of garage in car capacity
‘GarageArea’ - Size of garage in square feet
‘GarageQual’ - Garage quality
‘GarageCond’ - Garage condition
‘PavedDrive’ - Paved driveway
‘WoodDeckSF’ - Wood deck area in square feet
‘OpenPorchSF’ - Open porch area in square feet
‘EnclosedPorch’ - Enclosed porch area in square feet
‘3SsnPorch’ - Three season porch area in square feet
‘ScreenPorch’ - Screen porch area in square feet
‘PoolArea’ - Pool area in square feet
‘PoolQC’ - Pool quality
‘Fence’ - Fence quality
‘MiscFeature’ - Miscellaneous feature not covered in other categories
‘MiscVal’ - Value of miscellaneous feature
‘MoSold’ - Month Sold
‘YrSold’ - Year Sold
‘SaleType’ - Type of sale
‘SaleCondition’ - Condition of sale
The data set contains the variables of different types. The response variable ‘SalePrices’ is continuous quantitative variable. For example, the variables ‘Neighborhood’ and ‘RoofStyle’ are nominal categorical variables.
levels(factor(data$RoofStyle))
## [1] "Flat" "Gable" "Gambrel" "Hip" "Mansard" "Shed"
Whereas the variable ‘SaleCondition’ and ‘GarageQual’ are ordinal categorical variables.
levels(factor(data$SaleCondition))
## [1] "Abnorml" "AdjLand" "Alloca" "Family" "Normal" "Partial"
While the variables ‘CentralAir’ is a binary categorical variables that can take only “Y” or “N”.
levels(factor(data$CentralAir))
## [1] "N" "Y"
The variable ‘YrSold’ takes only 5 different values. So that, it was converted to categorical variable
levels(factor(data$YrSold))
## [1] "2006" "2007" "2008" "2009" "2010"
#convert the variables 'YrSold' from integer to factor since the variable can take only five different values
data <- data %>% mutate(YrSold=as.factor(YrSold))
## Warning: package 'bindrcpp' was built under R version 3.4.4
data_testing <- data_testing %>% mutate(YrSold=as.factor(YrSold))
Before performing summary statistics let’s start out by getting a high level view of what data is missing. The graph below depicts missing values for each variable in the data set.
#build function that counts missing values
count_nas <- function(data){
variable_name_column <- c()
number_missing_column <- c()
for (i in 2:ncol(data)){
variable_name <- colnames(data[i])
number_missing <- sum(is.na(data[i]))
variable_name_column <- c(variable_name_column,variable_name)
number_missing_column <- c(number_missing_column,number_missing)
}
missing_table <- data.frame(variable_name_column,number_missing_column)
missing_table <- missing_table %>% mutate(percentage=round(number_missing_column*100/nrow(data),2)) %>% arrange(desc(percentage))
missing_table
}
#chart for missing values
aggr(data[-1], prop = T, numbers = T, cex.axis=.5, cex.numbers = 0.1,
ylab=c("Proportion of missingness","Missingness Pattern"),
labels=names(data[-1]))
#count missing values
count_nas(data[2:length(data)-1])
## variable_name_column number_missing_column percentage
## 1 PoolQC 1453 99.52
## 2 MiscFeature 1406 96.30
## 3 Alley 1369 93.77
## 4 Fence 1179 80.75
## 5 FireplaceQu 690 47.26
## 6 LotFrontage 259 17.74
## 7 GarageType 81 5.55
## 8 GarageYrBlt 81 5.55
## 9 GarageFinish 81 5.55
## 10 GarageQual 81 5.55
## 11 GarageCond 81 5.55
## 12 BsmtExposure 38 2.60
## 13 BsmtFinType2 38 2.60
## 14 BsmtQual 37 2.53
## 15 BsmtCond 37 2.53
## 16 BsmtFinType1 37 2.53
## 17 MasVnrType 8 0.55
## 18 MasVnrArea 8 0.55
## 19 Electrical 1 0.07
## 20 MSSubClass 0 0.00
## 21 MSZoning 0 0.00
## 22 LotArea 0 0.00
## 23 Street 0 0.00
## 24 LotShape 0 0.00
## 25 LandContour 0 0.00
## 26 Utilities 0 0.00
## 27 LotConfig 0 0.00
## 28 LandSlope 0 0.00
## 29 Neighborhood 0 0.00
## 30 Condition1 0 0.00
## 31 Condition2 0 0.00
## 32 BldgType 0 0.00
## 33 HouseStyle 0 0.00
## 34 OverallQual 0 0.00
## 35 OverallCond 0 0.00
## 36 YearBuilt 0 0.00
## 37 YearRemodAdd 0 0.00
## 38 RoofStyle 0 0.00
## 39 RoofMatl 0 0.00
## 40 Exterior1st 0 0.00
## 41 Exterior2nd 0 0.00
## 42 ExterQual 0 0.00
## 43 ExterCond 0 0.00
## 44 Foundation 0 0.00
## 45 BsmtFinSF1 0 0.00
## 46 BsmtFinSF2 0 0.00
## 47 BsmtUnfSF 0 0.00
## 48 TotalBsmtSF 0 0.00
## 49 Heating 0 0.00
## 50 HeatingQC 0 0.00
## 51 CentralAir 0 0.00
## 52 X1stFlrSF 0 0.00
## 53 X2ndFlrSF 0 0.00
## 54 LowQualFinSF 0 0.00
## 55 GrLivArea 0 0.00
## 56 BsmtFullBath 0 0.00
## 57 BsmtHalfBath 0 0.00
## 58 FullBath 0 0.00
## 59 HalfBath 0 0.00
## 60 BedroomAbvGr 0 0.00
## 61 KitchenAbvGr 0 0.00
## 62 KitchenQual 0 0.00
## 63 TotRmsAbvGrd 0 0.00
## 64 Functional 0 0.00
## 65 Fireplaces 0 0.00
## 66 GarageCars 0 0.00
## 67 GarageArea 0 0.00
## 68 PavedDrive 0 0.00
## 69 WoodDeckSF 0 0.00
## 70 OpenPorchSF 0 0.00
## 71 EnclosedPorch 0 0.00
## 72 X3SsnPorch 0 0.00
## 73 ScreenPorch 0 0.00
## 74 PoolArea 0 0.00
## 75 MiscVal 0 0.00
## 76 MoSold 0 0.00
## 77 YrSold 0 0.00
## 78 SaleType 0 0.00
## 79 SaleCondition 0 0.00
count_nas(data_testing[2:length(data)-1])
## variable_name_column number_missing_column percentage
## 1 PoolQC 1456 99.79
## 2 MiscFeature 1408 96.50
## 3 Alley 1352 92.67
## 4 Fence 1169 80.12
## 5 FireplaceQu 730 50.03
## 6 LotFrontage 227 15.56
## 7 GarageYrBlt 78 5.35
## 8 GarageFinish 78 5.35
## 9 GarageQual 78 5.35
## 10 GarageCond 78 5.35
## 11 GarageType 76 5.21
## 12 BsmtCond 45 3.08
## 13 BsmtQual 44 3.02
## 14 BsmtExposure 44 3.02
## 15 BsmtFinType1 42 2.88
## 16 BsmtFinType2 42 2.88
## 17 MasVnrType 16 1.10
## 18 MasVnrArea 15 1.03
## 19 MSZoning 4 0.27
## 20 Utilities 2 0.14
## 21 BsmtFullBath 2 0.14
## 22 BsmtHalfBath 2 0.14
## 23 Functional 2 0.14
## 24 Exterior1st 1 0.07
## 25 Exterior2nd 1 0.07
## 26 BsmtFinSF1 1 0.07
## 27 BsmtFinSF2 1 0.07
## 28 BsmtUnfSF 1 0.07
## 29 TotalBsmtSF 1 0.07
## 30 KitchenQual 1 0.07
## 31 GarageCars 1 0.07
## 32 GarageArea 1 0.07
## 33 SaleType 1 0.07
## 34 MSSubClass 0 0.00
## 35 LotArea 0 0.00
## 36 Street 0 0.00
## 37 LotShape 0 0.00
## 38 LandContour 0 0.00
## 39 LotConfig 0 0.00
## 40 LandSlope 0 0.00
## 41 Neighborhood 0 0.00
## 42 Condition1 0 0.00
## 43 Condition2 0 0.00
## 44 BldgType 0 0.00
## 45 HouseStyle 0 0.00
## 46 OverallQual 0 0.00
## 47 OverallCond 0 0.00
## 48 YearBuilt 0 0.00
## 49 YearRemodAdd 0 0.00
## 50 RoofStyle 0 0.00
## 51 RoofMatl 0 0.00
## 52 ExterQual 0 0.00
## 53 ExterCond 0 0.00
## 54 Foundation 0 0.00
## 55 Heating 0 0.00
## 56 HeatingQC 0 0.00
## 57 CentralAir 0 0.00
## 58 Electrical 0 0.00
## 59 X1stFlrSF 0 0.00
## 60 X2ndFlrSF 0 0.00
## 61 LowQualFinSF 0 0.00
## 62 GrLivArea 0 0.00
## 63 FullBath 0 0.00
## 64 HalfBath 0 0.00
## 65 BedroomAbvGr 0 0.00
## 66 KitchenAbvGr 0 0.00
## 67 TotRmsAbvGrd 0 0.00
## 68 Fireplaces 0 0.00
## 69 PavedDrive 0 0.00
## 70 WoodDeckSF 0 0.00
## 71 OpenPorchSF 0 0.00
## 72 EnclosedPorch 0 0.00
## 73 X3SsnPorch 0 0.00
## 74 ScreenPorch 0 0.00
## 75 PoolArea 0 0.00
## 76 MiscVal 0 0.00
## 77 MoSold 0 0.00
## 78 YrSold 0 0.00
## 79 SaleCondition 0 0.00
The chart shows that the variable ‘PoolQC’, ‘MiscFeature’ and ‘Alley’ are missing more than 90% of data while the variable ‘Fence’ is missing about 80% of data. The variable ‘FireplaceQu’ is missing a little bit less than 50% of its data whereas the variable ‘LotFrontage’ is missing about 18% of its data. All other variables are either missing less that 5% of the data or don’t contain missing values.
Since the variable ‘PoolQC’, ‘MiscFeature’ and ‘Alley’ are missing more than 90% of the data they probably lost their statistical power. That is why the variables were removed from the data set.
#remove the variables 'PoolQC', 'MiscFeature' and 'Alley' from the dataset
data <- data %>% dplyr::select(-PoolQC,-MiscFeature,-Alley)
data_testing <- data_testing %>% dplyr::select(-PoolQC,-MiscFeature,-Alley)
The missing values of the variable ‘Fence’, that is missing about 80% of its data, can be predictive of outcome. So that, it was decided to replace missing values of ‘Fence’ with “NONE”.
data <- data %>% mutate(Fence = as.factor(ifelse(is.na(Fence),"NONE",Fence)))
data_testing <- data_testing %>% mutate(Fence = as.factor(ifelse(is.na(Fence),"NONE",Fence)))
Instead of throwing out missing values of the remaining variables containing missing data they were reasonably imputed. The method cart’ (classification and regression trees) that is also known as decision trees was chosen for multiple imputation. So that, now we’ve got a lot more data to feed to a model.
#apply multiple imputation for training data set
#exclude variable'ID'
exclude <- c('Id')
index <- data[1]
include <- setdiff(names(data), exclude)
data_include <- data[include]
#imputation with mean
imp.data <- mice(data_include, m=5, method='cart', printFlag=FALSE)
#merge imputed values with data frame
data <- mice::complete(imp.data)
data <- data.frame(index,data)
#confirm no NAs
count_nas(data)
#write data
write.csv(data, file = "/Users/olga/downloads/data_imp.csv", row.names = FALSE)
#apply multiple imputation for testing data set
#exclude variable'ID'
exclude <- c('Id')
index <- data_testing[1]
include <- setdiff(names(data_testing), exclude)
data_testing_include <- data_testing[include]
#imputation with mean
imp.data_testing <- mice(data_testing_include, m=5, method='cart', printFlag=FALSE)
#merge imputed values with data frame
data_testing <- mice::complete(imp.data_testing)
data_testing <- data.frame(index,data_testing)
#confirm no NAs
count_nas(data_testing)
#write data
write.csv(data_testing, file = "/Users/olga/downloads/data_testing_imp.csv", row.names = FALSE)
Let’s define the independent variable ‘GrLivArea’ as X and the dependent variable ‘SalePrice’ as Y.
X <- data$GrLivArea
Y <- data$SalePrice
XY_df <- data.frame(X,Y)
#analyse distribution of X
par(mfrow=c(1,2))
plotNormalHistogram(X,main="GrLivArea")
qqnorm(X)
qqline(X)
#analyse distribution of Y
par(mfrow=c(1,2))
plotNormalHistogram(Y,main="SalePrice")
qqnorm(Y)
qqline(Y)
Based on histograms it can be concluded that the distributions of both variables are bell-shaped and unimodal. However, they are not symmetric and skewed to the right. It means that the means of both distributions are greater that their medians.
summary(XY_df)
## X Y
## Min. : 334 Min. : 34900
## 1st Qu.:1130 1st Qu.:129975
## Median :1464 Median :163000
## Mean :1515 Mean :180921
## 3rd Qu.:1777 3rd Qu.:214000
## Max. :5642 Max. :755000
Summary statistics confirmed that the mean is greater than the median for both distributions. According Q-Q plots, most of the points on the top and the bottom deviates from the line. The points form a curve instead of a straight line.
#calculate 3rd quartile of X
X_3d_quantile <- quantile(X, probs=0.75)
X_3d_quantile
## 75%
## 1776.75
#calculate 2nd quartile of Y
Y_2d_quantile <- quantile(Y, probs=0.50)
P1 <- nrow(subset(XY_df, X > X_3d_quantile && Y > Y_2d_quantile))/nrow(subset(XY_df, Y > Y_2d_quantile))
P1
## [1] 0
#calculate 3rd quartile of X
X_3d_quantile <- quantile(X, probs=0.75)
X_3d_quantile
## 75%
## 1776.75
#calculate 2nd quartile of Y
Y_2d_quantile <- quantile(Y, probs=0.50)
P2 <- nrow(subset(XY_df, X > X_3d_quantile,data=data && Y > Y_2d_quantile,data=data))/nrow(data)
P2
## [1] 0.25
#calculate 3rd quartile of X
X_3d_quantile <- quantile(X, probs=0.75)
X_3d_quantile
## 75%
## 1776.75
#calculate 2nd quartile of Y
Y_2d_quantile <- quantile(Y, probs=0.50)
P3 <- nrow(subset(XY_df, X < X_3d_quantile && Y > Y_2d_quantile))/nrow(subset(XY_df, Y > Y_2d_quantile))
P3
## [1] 2.005495
#find P(X<=x,Y<=y)
P4 <- nrow(subset(XY_df, X <= X_3d_quantile,data=data && Y <= Y_2d_quantile,data=data))/nrow(data)
#find P(X>x,Y<=y)
P5 <- nrow(subset(XY_df, X > X_3d_quantile,data=data && Y <= Y_2d_quantile,data=data))/nrow(data)
#find P(X<=x,Y>y)
P6 <- nrow(subset(XY_df, X <= X_3d_quantile,data=data && Y > Y_2d_quantile,data=data))/nrow(data)
first.column <- c("<=3rd quartile",">3rd quartile","Total")
second.column <- c(P4,P5, P4+P5)
third.column <- c(P6,P2,P6+P2)
forth.column <- c(P4+P6,P2+P5,P4+P6+P2+P5)
table <- data.frame(first.column,second.column,third.column,forth.column)
colnames(table) <- c("x/y","<=2d quartile",">2d quartile","Total")
table
## x/y <=2d quartile >2d quartile Total
## 1 <=3rd quartile 0.75 0.75 1.5
## 2 >3rd quartile 0.25 0.25 0.5
## 3 Total 1.00 1.00 2.0
Two discrete random variables A and B are called independent if: \(P(A,B) = P(A)P(B)\) \(P(A|B)=\frac{P(A,B)}{P(B)}=\frac{P(A)P(B)}{P(B)}=P(A)\)
#calculate 1st quartile of X
X_1st_quantile <- quantile(X, probs=0.25)
X_1st_quantile
## 25%
## 1129.5
#calculate 1st quartile of Y
Y_1st_quantile <- quantile(Y, probs=0.25)
Y_1st_quantile
## 25%
## 129975
#define variables A and B
A <- subset(XY_df, X > X_1st_quantile)
B <- subset(XY_df, Y > Y_1st_quantile)
P_A <- nrow(A)/nrow(data)
P_B <- nrow(B)/nrow(data)
P_A_and_P_B <-nrow(subset(XY_df, X > X_3d_quantile && Y > Y_2d_quantile))/nrow(data)
cat("P(A)*P(B) =",P_A*P_B,"\n")
## P(A)*P(B) = 0.5625
P_A_given_P_B = P1
cat("P(A)|P(B) =",P_A_given_P_B,"\n")
## P(A)|P(B) = 0
cat("Does P(A,B) = P(A)P(B)?",P_A*P_B==P_A_and_P_B,"\n")
## Does P(A,B) = P(A)P(B)? FALSE
cat("Does P(A)|P(B)=P(A)?",P_A_given_P_B==P_A)
## Does P(A)|P(B)=P(A)? FALSE
Two variables are NOT independent as the condition \(P(A,B) = P(A)P(B)\) is not met.
Let’s state the following hypothesis for a Chi Square test.
\(H_0\) : Variable A is independent of variable B \(H_A\) : Variable A is NOT independent of variable B.
chisq.test(A,B)
##
## Pearson's Chi-squared test
##
## data: A
## X-squared = 179460, df = 1094, p-value < 2.2e-16
Since p-value is less that the level of significance of 5% the null hypothesis is rejected. The variables A and B are NOT independent.
Let’s select three independent variables - ‘OpenPorchSF’,‘WoodDeckSF’ and ‘Garage Area’.
#create data set that contains one dependent variable and three independent variables
data_set <- data %>% dplyr::select(SalePrice,OpenPorchSF,WoodDeckSF,GarageArea)
#create scatterplots
par(mfrow=c(1,3))
for (i in 2:ncol(data_set)) {
if (is.numeric(data_set[,i]) == "TRUE"){
plot(data_set$SalePrice ~ data_set[,i],main=names(data_set)[i], xlab=names(data_set)[i],col="blue")
reg_line <- lm(data_set$SalePrice ~ data_set[,i])
abline(reg_line,col="red")
}
}
The explanatory variable ‘Garage Area’ has strong linear relationships with response variable ‘SalePrice’ while two other variables ‘OpenPorchSF’ and ‘WoodDeckSF’ are in week linear relationships with response variable.
#create histograms and Q-Q plots
for (i in 2:ncol(data_set)) {
par(mfrow=c(1,2))
plotNormalHistogram(data_set[,i],main=names(data_set)[i])
qqnorm(data_set[,i],main=names(data_set)[i])
qqline(data_set[,i])
}
Based on histograms it can be concluded that the distributions of both variables are bell-shaped and unimodal. However, they are not symmetric and skewed to the right. It means that the means of both distributions are greater that their medians.
summary(data_set)
## SalePrice OpenPorchSF WoodDeckSF GarageArea
## Min. : 34900 Min. : 0.00 Min. : 0.00 Min. : 0.0
## 1st Qu.:129975 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 334.5
## Median :163000 Median : 25.00 Median : 0.00 Median : 480.0
## Mean :180921 Mean : 46.66 Mean : 94.24 Mean : 473.0
## 3rd Qu.:214000 3rd Qu.: 68.00 3rd Qu.:168.00 3rd Qu.: 576.0
## Max. :755000 Max. :547.00 Max. :857.00 Max. :1418.0
Summary statistics confirmed that the mean is greater than the median for both distributions. According Q-Q plots, most of the points on the top and the bottom deviates from the line. The points form a curve instead of a straight line.
#correlation between variables
correlation <- corrplot(cor(data_set), type = "upper", method = "number", tl.cex = 0.7, tl.col="black",number.cex = 0.5)
correlation
## SalePrice OpenPorchSF WoodDeckSF GarageArea
## SalePrice 1.0000000 0.31585623 0.32441344 0.6234314
## OpenPorchSF 0.3158562 1.00000000 0.05866061 0.2414347
## WoodDeckSF 0.3244134 0.05866061 1.00000000 0.2246663
## GarageArea 0.6234314 0.24143467 0.22466631 1.0000000
The correlation plot shows that neither of the independent variables are strongly correlated with another independent variables. It suggests that potentially all three independent variables can be included to the regression model. The variables that are not strongly correlated with each other will not cost model overfitting.
In order to verify the correlation between independent variables let’s state the following hypothesis:
\(H_0\): There is no correlation \(\beta_1 = 0\) \(H_a\): There is a correlation \(\beta_1 \neq 0\)
t.test(data_set$OpenPorchSF,data_set$WoodDeckSF,conf.level = 0.92,data=data_set)
##
## Welch Two Sample t-test
##
## data: data_set$OpenPorchSF and data_set$WoodDeckSF
## t = -12.825, df = 2215.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
## -54.08293 -41.08556
## sample estimates:
## mean of x mean of y
## 46.66027 94.24452
t.test(data_set$OpenPorchSF,data_set$GarageArea,conf.level = 0.92,data=data_set)
##
## Welch Two Sample t-test
##
## data: data_set$OpenPorchSF and data_set$GarageArea
## t = -72.775, df = 1736.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
## -436.5815 -416.0583
## sample estimates:
## mean of x mean of y
## 46.66027 472.98014
t.test(data_set$WoodDeckSF,data_set$GarageArea,conf.level = 0.92,data=data_set)
##
## Welch Two Sample t-test
##
## data: data_set$WoodDeckSF and data_set$GarageArea
## t = -58.391, df = 2355.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
## -390.0957 -367.3755
## sample estimates:
## mean of x mean of y
## 94.24452 472.98014
The T-test showed that the p-values of all three test is less that the level of significance of 5%. So that, the null hypothesis is rejected. There are correlation between all three pairs of independent variables.
cotrelation_matrix <- matrix(c(correlation[2,2], correlation[2,3], correlation[2,4],
correlation[3,2], correlation[3,3], correlation[3,4],
correlation[4,2], correlation[4,3], correlation[4,4]),
nrow=3, ncol=3, byrow = TRUE)
cotrelation_matrix
## [,1] [,2] [,3]
## [1,] 1.00000000 0.05866061 0.2414347
## [2,] 0.05866061 1.00000000 0.2246663
## [3,] 0.24143467 0.22466631 1.0000000
#inverse matrix
precision_matrix <- ginv(cotrelation_matrix)
#identity matrix
round(precision_matrix %*% cotrelation_matrix,2)
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
round(cotrelation_matrix %*% precision_matrix,2)
## [,1] [,2] [,3]
## [1,] 1 0 0
## [2,] 0 1 0
## [3,] 0 0 1
#create LU decomposition function
LU <- function(A,row_num,col_num) {
#create matrix L that consist of 0s
L = matrix(c(0:0),nrow=row_num, ncol=col_num, byrow = TRUE)
#assign 1s to the main diaginal
i=1
for (i in 1:3){
L[i,i]=1
}
#assign matrix A to initial matrix U
U <- A
print("matrix A")
print(A)
#start from the first column
col=1
#iterate until col_num-1
for (col in 1:(col_num-1)){
#start to iterate from the second row until final row
row=col+1
for (row in (col+1):row_num){
#U[row,col] = U[row,col]+x*U[col,col]
x=-U[row,col]/U[col,col]
U[row,col] = 0
#multiply the remaining elements of a current row by x
col2=col+1
for (col2 in (col+1):col_num){
U[row,col2] = U[row,col2]+x*U[col,col2]
print("matrix U")
print(U)
}
#assign x to a corresponding cell in matrix L
L[row,col] = -x
}
}
print("matrix L")
print(L)
#dot product of L and U should produce A
print("matrix LU=A")
print(L %*% U)
}
LU(cotrelation_matrix,3,3)
## [1] "matrix A"
## [,1] [,2] [,3]
## [1,] 1.00000000 0.05866061 0.2414347
## [2,] 0.05866061 1.00000000 0.2246663
## [3,] 0.24143467 0.22466631 1.0000000
## [1] "matrix U"
## [,1] [,2] [,3]
## [1,] 1.0000000 0.05866061 0.2414347
## [2,] 0.0000000 0.99655893 0.2246663
## [3,] 0.2414347 0.22466631 1.0000000
## [1] "matrix U"
## [,1] [,2] [,3]
## [1,] 1.0000000 0.05866061 0.2414347
## [2,] 0.0000000 0.99655893 0.2105036
## [3,] 0.2414347 0.22466631 1.0000000
## [1] "matrix U"
## [,1] [,2] [,3]
## [1,] 1 0.05866061 0.2414347
## [2,] 0 0.99655893 0.2105036
## [3,] 0 0.21050360 1.0000000
## [1] "matrix U"
## [,1] [,2] [,3]
## [1,] 1 0.05866061 0.2414347
## [2,] 0 0.99655893 0.2105036
## [3,] 0 0.21050360 0.9417093
## [1] "matrix U"
## [,1] [,2] [,3]
## [1,] 1 0.05866061 0.2414347
## [2,] 0 0.99655893 0.2105036
## [3,] 0 0.00000000 0.8972445
## [1] "matrix L"
## [,1] [,2] [,3]
## [1,] 1.00000000 0.0000000 0
## [2,] 0.05866061 1.0000000 0
## [3,] 0.24143467 0.2112305 1
## [1] "matrix LU=A"
## [,1] [,2] [,3]
## [1,] 1.00000000 0.05866061 0.2414347
## [2,] 0.05866061 1.00000000 0.2246663
## [3,] 0.24143467 0.22466631 1.0000000
fit_exp <- fitdistr(data$GrLivArea, "exponential")
#optimal value of λ for this distribution
lambda <- fit_exp$estimate
sample <- rexp(1000, lambda)
lambda
## rate
## 0.000659864
#histogram
sample<-data.frame(as.numeric(sample))
colnames(sample)[1] <- "sample"
str(sample)
## 'data.frame': 1000 obs. of 1 variable:
## $ sample: num 1532.8 4923.8 238.3 554.7 16.3 ...
hist(sample$sample, main="Histogram of Exponential Distribution", xlab = "GrLivArea", breaks=30)
hist(sample$sample, main="Histogram of GrLivArea", xlab = "GrLivArea", breaks=30)
The histograms of exponential distribution and original distribution of ‘GrLivArea’ are similar.
#5th percentile
qexp(.05,rate = lambda)
## [1] 77.73313
#95th percentile
qexp(.95, rate = lambda)
## [1] 4539.924
#95% confidence interval from the empirical data
ci(data$GrLivArea, 0.95)
## Estimate CI lower CI upper Std. Error
## 1515.46370 1488.48701 1542.44038 13.75245
‘GrLivArea’ falls into interval of (1488.48701,1542.44038) in 95% of the cases.
#5th percentile of GrLivArea
quantile(data$GrLivArea, .05)
## 5%
## 848
#95th percentile of GrLivArea
quantile(data$GrLivArea, .95)
## 95%
## 2466.1
Since the response variable is continuous it was decided to use GLM to build the model that predicts ‘SalePrice’.
The following assumptions must be verified for linear regression.
Independence of observations. It was assumed that all observations are independent.
Linearity assumption. Linear regression needs the relationship between the independent and dependent variables to be linear. Also, outliers need to be checked.
par(mfrow=c(2,2))
#colnames <- dimnames(data)[[2]]
for (i in 2:(ncol(data)-1)) {
if (is.numeric(data[,i]) == "TRUE"){
plot(data$SalePrice ~ data[,i],main=names(data)[i], xlab=names(data)[i],col="blue")
reg_line <- lm(data$SalePrice ~ data[,i])
abline(reg_line,col="red")
}
}
#ggplot(data, aes(SalePrice,LotFrontage)) + stat_smooth(method="loess") + stat_smooth(method="lm", color="red", fill="red", alpha=.25) + geom_point()
By looking at scatter plots, it can be concluded that the variables ‘Ms Subclass’, ‘OverallCond’, ‘BsmtFinSF1’,‘LowQualFinSF’, ‘BsmtHalfBath’ are in non-linear relationships with the response variable ‘SalePrice’. Thus, these variables need to be transformed.
# histograms, density lines and normal probability plots
for (i in 2:ncol(data)) {
if (is.numeric(data[,i]) == "TRUE"){
par(mfrow=c(1,2))
plotNormalHistogram(data[,i],main=names(data)[i])
qqnorm(data[,i],main=names(data)[i])
qqline(data[,i])
}}
Only variables ‘OverallQual’ and ‘GarageCars’ are close to normal distribution as their distributions are bell-shaped and symmetric. All other variables are either skewed to the right or skewed to the left. The variables that don’t meet normality assumption must be transformed.
#correlation between variables
corrplot(cor(data[2:ncol(data)] %>% select_if(is.numeric)), type = "upper", method = "number", tl.cex = 0.7, tl.col="black",number.cex = 0.5)
According to the correlation matrix, a few variables in the data set are strongly correlated with each other. For example, ‘GarageYrBlt’ and ‘YrBlt’ are strongly correlated (as usually garage is built in the same year house is built). Only one of these two variables should be included in the model. Otherwise, the model will be overfitted.
model = glm(SalePrice ~., data = data)
#Breush Pagan Test
bptest(model)
##
## studentized Breusch-Pagan test
##
## data: model
## BP = 867.76, df = 241, p-value < 2.2e-16
Breush Pagan Test p-value is less than a significance level of 5%, therefore we can reject the null hypothesis that the variance of the residuals is constant and infer that heteroscedasticity is present.
Only variables “OverallQual” and “GarageCars” meet assumptions requirments. All other variables that don’t meet either of the assumption requirements were transformed. The transformation were performed based Box-Cox method.
#create dataframe to store modified data
data_m <- data
#resolve problem with non-negative values
for (i in 2:(ncol(data_m)-1)){
for (j in 1:nrow(data_m)){
if (is.numeric(data_m[j,i])==TRUE && data_m[j,i] == 0 ) {
data_m[j,i]=data_m[j,i]+0.01}
}
}
head(data_m)
#data transformation
for (i in 2:ncol(data_m)) {
if (colnames(data_m[i])!="OverallQual" || colnames(data_m[i])!="GarageCars"){
if (is.numeric(data_m[,i])==TRUE){
print(colnames(data_m[i]))
Box = boxcox(data_m[,i] ~ 1, # Transform Turbidity as a single vector
lambda = seq(-10,10,1) # Try values -10 to 10 by 0.1
)
Cox = data.frame(Box$x, Box$y) # Create a data frame with the results
Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y
#Cox2[1,] # Display the lambda with the greatest
# log likelihood
lambda = Cox2[1, "Box.x"] # Extract that lambda
print(lambda)
data_m[,i] = (data_m[,i] ^ lambda - 1)/lambda # Transform the original data
#plotNormalHistogram(data[,i],main=names(data)[i])
#qqnorm(data[,i],main=names(data)[i])
#qqline(data[,i])
}
}
}
In order to find the best regression model it was decided to implement the stepwise approach that analyses all combination of variables and selects the best regression model based on lowest AIC (Akaike’s criterion) value. Lower values of AIC indicate the preferred model, that is, the one with the fewest parameters that still provides an adequate fit to the data.
#build glm model using stepwise approach
linear_model.null = glm(SalePrice ~ 1, data = data_m)
#linear_model.null <- glm(SalePrice ~ 1, family=gaussian(link="log"), data = data)
linear_model.full = glm(SalePrice ~ ., data = data_m)
#linear_model.full <- glm(SalePrice ~ ., data = data, family=gaussian(link="log"))
step(linear_model.null,
scope = list(upper=linear_model.full),
direction = "both",
data = data_m)
Let’s review summary statistics for the optimal model (model with lowest AIC).
final_model <- glm(formula = SalePrice ~ OverallQual + GrLivArea + Neighborhood +
BsmtFinSF1 + OverallCond + GarageCars + TotalBsmtSF + RoofMatl +
YearBuilt + LotArea + MSZoning + Condition2 + KitchenQual +
SaleCondition + Functional + BsmtUnfSF + Condition1 + CentralAir +
Fireplaces + Heating + HeatingQC + BsmtExposure + GarageCond +
KitchenAbvGr + YearRemodAdd + Exterior1st + BsmtQual + PoolArea +
LotConfig + ScreenPorch + Foundation + SaleType + BsmtFullBath +
LandSlope + GarageArea + GarageType + GarageQual + LowQualFinSF +
WoodDeckSF + BsmtFinSF2 + ExterCond + EnclosedPorch + X1stFlrSF +
HalfBath, data = data)
summary(final_model)
##
## Call:
## glm(formula = SalePrice ~ OverallQual + GrLivArea + Neighborhood +
## BsmtFinSF1 + OverallCond + GarageCars + TotalBsmtSF + RoofMatl +
## YearBuilt + LotArea + MSZoning + Condition2 + KitchenQual +
## SaleCondition + Functional + BsmtUnfSF + Condition1 + CentralAir +
## Fireplaces + Heating + HeatingQC + BsmtExposure + GarageCond +
## KitchenAbvGr + YearRemodAdd + Exterior1st + BsmtQual + PoolArea +
## LotConfig + ScreenPorch + Foundation + SaleType + BsmtFullBath +
## LandSlope + GarageArea + GarageType + GarageQual + LowQualFinSF +
## WoodDeckSF + BsmtFinSF2 + ExterCond + EnclosedPorch + X1stFlrSF +
## HalfBath, data = data)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -187686 -10406 203 10194 187686
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.510e+06 1.750e+05 -8.627 < 2e-16 ***
## OverallQual 6.690e+03 9.857e+02 6.787 1.73e-11 ***
## GrLivArea 5.626e+01 2.687e+00 20.937 < 2e-16 ***
## NeighborhoodBlueste 5.692e+03 1.896e+04 0.300 0.764050
## NeighborhoodBrDale 5.597e+03 1.022e+04 0.548 0.583905
## NeighborhoodBrkSide 1.813e+04 8.318e+03 2.179 0.029487 *
## NeighborhoodClearCr 4.694e+03 8.416e+03 0.558 0.577099
## NeighborhoodCollgCr 8.634e+03 6.415e+03 1.346 0.178562
## NeighborhoodCrawfor 2.898e+04 7.785e+03 3.722 0.000206 ***
## NeighborhoodEdwards -1.962e+03 7.232e+03 -0.271 0.786234
## NeighborhoodGilbert 8.265e+03 6.807e+03 1.214 0.224912
## NeighborhoodIDOTRR 1.782e+04 9.605e+03 1.856 0.063732 .
## NeighborhoodMeadowV -6.387e+03 1.073e+04 -0.595 0.551632
## NeighborhoodMitchel -5.713e+03 7.385e+03 -0.774 0.439306
## NeighborhoodNAmes 1.524e+03 6.988e+03 0.218 0.827423
## NeighborhoodNoRidge 5.146e+04 7.558e+03 6.808 1.50e-11 ***
## NeighborhoodNPkVill 9.193e+03 1.081e+04 0.850 0.395208
## NeighborhoodNridgHt 3.147e+04 6.866e+03 4.583 5.02e-06 ***
## NeighborhoodNWAmes 1.689e+03 7.271e+03 0.232 0.816313
## NeighborhoodOldTown 1.075e+04 8.409e+03 1.278 0.201521
## NeighborhoodSawyer 6.341e+03 7.372e+03 0.860 0.389876
## NeighborhoodSawyerW 1.246e+04 7.141e+03 1.744 0.081339 .
## NeighborhoodSomerst 1.786e+04 8.241e+03 2.167 0.030448 *
## NeighborhoodStoneBr 4.429e+04 7.982e+03 5.549 3.47e-08 ***
## NeighborhoodSWISU 3.684e+03 9.004e+03 0.409 0.682514
## NeighborhoodTimber 3.457e+03 7.405e+03 0.467 0.640732
## NeighborhoodVeenker 1.153e+04 1.019e+04 1.132 0.258027
## BsmtFinSF1 1.332e+01 4.516e+00 2.950 0.003238 **
## OverallCond 5.883e+03 8.514e+02 6.909 7.58e-12 ***
## GarageCars 3.001e+03 2.206e+03 1.360 0.173980
## TotalBsmtSF 2.445e+01 5.593e+00 4.371 1.33e-05 ***
## RoofMatlCompShg 6.733e+05 3.049e+04 22.081 < 2e-16 ***
## RoofMatlMembran 7.387e+05 4.204e+04 17.570 < 2e-16 ***
## RoofMatlMetal 7.043e+05 4.119e+04 17.100 < 2e-16 ***
## RoofMatlRoll 6.717e+05 3.960e+04 16.964 < 2e-16 ***
## RoofMatlTar&Grv 6.666e+05 3.169e+04 21.039 < 2e-16 ***
## RoofMatlWdShake 6.948e+05 3.310e+04 20.994 < 2e-16 ***
## RoofMatlWdShngl 7.256e+05 3.188e+04 22.758 < 2e-16 ***
## YearBuilt 3.132e+02 7.124e+01 4.397 1.19e-05 ***
## LotArea 6.026e-01 9.483e-02 6.354 2.88e-10 ***
## MSZoningFV 3.792e+04 1.203e+04 3.152 0.001660 **
## MSZoningRH 3.662e+04 1.208e+04 3.032 0.002480 **
## MSZoningRL 3.761e+04 1.014e+04 3.710 0.000216 ***
## MSZoningRM 2.636e+04 9.540e+03 2.763 0.005803 **
## Condition2Feedr 2.765e+03 2.269e+04 0.122 0.903010
## Condition2Norm -7.113e+02 1.947e+04 -0.037 0.970866
## Condition2PosA 6.267e+04 3.581e+04 1.750 0.080341 .
## Condition2PosN -2.177e+05 2.740e+04 -7.944 4.19e-15 ***
## Condition2RRAe -2.555e+04 3.154e+04 -0.810 0.417955
## Condition2RRAn -2.020e+04 3.158e+04 -0.640 0.522453
## Condition2RRNn 1.037e+04 2.673e+04 0.388 0.698050
## KitchenQualFa -2.507e+04 6.034e+03 -4.155 3.46e-05 ***
## KitchenQualGd -3.207e+04 3.343e+03 -9.596 < 2e-16 ***
## KitchenQualTA -2.960e+04 3.810e+03 -7.770 1.58e-14 ***
## SaleConditionAdjLand 1.538e+04 1.330e+04 1.156 0.247828
## SaleConditionAlloca 5.194e+03 8.447e+03 0.615 0.538692
## SaleConditionFamily 5.630e+02 6.203e+03 0.091 0.927699
## SaleConditionNormal 7.095e+03 2.911e+03 2.437 0.014927 *
## SaleConditionPartial -6.102e+03 1.523e+04 -0.401 0.688778
## FunctionalMaj2 7.526e+03 1.350e+04 0.557 0.577400
## FunctionalMin1 8.934e+03 8.289e+03 1.078 0.281315
## FunctionalMin2 9.165e+03 8.238e+03 1.112 0.266150
## FunctionalMod 2.157e+03 1.002e+04 0.215 0.829639
## FunctionalSev -3.922e+04 2.841e+04 -1.380 0.167671
## FunctionalTyp 2.013e+04 7.101e+03 2.835 0.004657 **
## BsmtUnfSF -5.785e+00 4.689e+00 -1.234 0.217560
## Condition1Feedr 8.007e+03 4.990e+03 1.605 0.108836
## Condition1Norm 1.384e+04 4.100e+03 3.375 0.000761 ***
## Condition1PosA 7.346e+03 9.953e+03 0.738 0.460620
## Condition1PosN 7.136e+03 7.357e+03 0.970 0.332243
## Condition1RRAe -1.321e+04 9.273e+03 -1.424 0.154625
## Condition1RRAn 5.242e+03 6.813e+03 0.769 0.441782
## Condition1RRNe 3.304e+03 1.799e+04 0.184 0.854319
## Condition1RRNn 5.062e+03 1.254e+04 0.404 0.686440
## CentralAirY -2.202e+03 3.672e+03 -0.600 0.548896
## Fireplaces 2.686e+03 1.340e+03 2.004 0.045293 *
## HeatingGasA -4.495e+03 2.547e+04 -0.176 0.859961
## HeatingGasW -9.147e+03 2.622e+04 -0.349 0.727257
## HeatingGrav -2.373e+02 2.736e+04 -0.009 0.993081
## HeatingOthW -3.408e+04 3.133e+04 -1.088 0.276851
## HeatingWall 8.715e+03 2.902e+04 0.300 0.763962
## HeatingQCFa -3.420e+03 4.612e+03 -0.741 0.458530
## HeatingQCGd -3.845e+03 2.102e+03 -1.829 0.067591 .
## HeatingQCPo 7.007e+03 2.634e+04 0.266 0.790254
## HeatingQCTA -4.305e+03 2.078e+03 -2.071 0.038509 *
## BsmtExposureGd 1.704e+04 2.962e+03 5.754 1.09e-08 ***
## BsmtExposureMn -2.404e+03 2.933e+03 -0.820 0.412519
## BsmtExposureNo -4.876e+03 2.040e+03 -2.390 0.017009 *
## GarageCondFa 1.022e+05 3.448e+04 2.965 0.003080 **
## GarageCondGd 1.004e+05 3.547e+04 2.831 0.004716 **
## GarageCondPo 1.131e+05 3.576e+04 3.162 0.001605 **
## GarageCondTA 1.063e+05 3.412e+04 3.114 0.001885 **
## KitchenAbvGr -1.923e+04 3.764e+03 -5.110 3.71e-07 ***
## YearRemodAdd 9.116e+01 5.334e+01 1.709 0.087666 .
## Exterior1stAsphShn -9.359e+02 2.579e+04 -0.036 0.971055
## Exterior1stBrkComm -3.794e+03 2.010e+04 -0.189 0.850319
## Exterior1stBrkFace 7.675e+03 7.175e+03 1.070 0.284939
## Exterior1stCBlock -1.671e+04 2.556e+04 -0.654 0.513315
## Exterior1stCemntBd 1.991e+02 7.498e+03 0.027 0.978822
## Exterior1stHdBoard -9.290e+03 6.544e+03 -1.420 0.155957
## Exterior1stImStucc -2.709e+04 2.529e+04 -1.071 0.284345
## Exterior1stMetalSd -5.972e+03 6.352e+03 -0.940 0.347322
## Exterior1stPlywood -1.323e+04 6.869e+03 -1.927 0.054235 .
## Exterior1stStone -2.866e+04 2.005e+04 -1.429 0.153120
## Exterior1stStucco -2.523e+03 8.012e+03 -0.315 0.752860
## Exterior1stVinylSd -3.637e+03 6.405e+03 -0.568 0.570192
## Exterior1stWd Sdng -8.038e+03 6.344e+03 -1.267 0.205431
## Exterior1stWdShing -7.834e+03 7.923e+03 -0.989 0.322969
## BsmtQualFa -2.149e+04 6.141e+03 -3.499 0.000483 ***
## BsmtQualGd -2.901e+04 3.237e+03 -8.962 < 2e-16 ***
## BsmtQualTA -2.518e+04 4.039e+03 -6.234 6.11e-10 ***
## PoolArea 9.275e+01 1.804e+01 5.142 3.13e-07 ***
## LotConfigCulDSac 7.456e+03 3.152e+03 2.366 0.018150 *
## LotConfigFR2 -7.694e+03 4.035e+03 -1.907 0.056791 .
## LotConfigFR3 -1.909e+04 1.293e+04 -1.477 0.140046
## LotConfigInside -1.357e+03 1.768e+03 -0.768 0.442904
## ScreenPorch 2.100e+01 1.245e+01 1.687 0.091807 .
## FoundationCBlock 1.480e+03 3.167e+03 0.467 0.640461
## FoundationPConc 3.313e+03 3.462e+03 0.957 0.338749
## FoundationSlab 1.185e+04 7.626e+03 1.554 0.120503
## FoundationStone 1.037e+04 1.075e+04 0.965 0.334650
## FoundationWood -1.760e+04 1.474e+04 -1.194 0.232753
## SaleTypeCon 2.680e+04 1.819e+04 1.474 0.140855
## SaleTypeConLD 8.499e+03 9.723e+03 0.874 0.382212
## SaleTypeConLI 3.772e+03 1.180e+04 0.320 0.749289
## SaleTypeConLw 2.598e+03 1.208e+04 0.215 0.829730
## SaleTypeCWD 6.664e+03 1.319e+04 0.505 0.613552
## SaleTypeNew 3.049e+04 1.579e+04 1.931 0.053649 .
## SaleTypeOth 1.527e+04 1.486e+04 1.028 0.304280
## SaleTypeWD -7.195e+02 4.243e+03 -0.170 0.865378
## BsmtFullBath 9.024e+01 1.804e+03 0.050 0.960118
## LandSlopeMod 3.513e+03 3.536e+03 0.993 0.320684
## LandSlopeSev -3.149e+04 1.015e+04 -3.103 0.001956 **
## GarageArea 1.986e+01 7.569e+00 2.624 0.008786 **
## GarageTypeAttchd 2.655e+04 1.073e+04 2.475 0.013452 *
## GarageTypeBasment 2.080e+04 1.215e+04 1.712 0.087092 .
## GarageTypeBuiltIn 3.060e+04 1.113e+04 2.749 0.006061 **
## GarageTypeCarPort 2.631e+04 1.338e+04 1.966 0.049533 *
## GarageTypeDetchd 2.768e+04 1.065e+04 2.599 0.009459 **
## GarageQualFa -1.111e+05 2.950e+04 -3.767 0.000173 ***
## GarageQualGd -1.085e+05 3.029e+04 -3.582 0.000353 ***
## GarageQualPo -1.323e+05 3.414e+04 -3.875 0.000112 ***
## GarageQualTA -1.110e+05 2.924e+04 -3.795 0.000154 ***
## LowQualFinSF -5.826e+01 1.555e+01 -3.746 0.000188 ***
## WoodDeckSF 1.011e+01 5.882e+00 1.719 0.085876 .
## BsmtFinSF2 NA NA NA NA
## ExterCondFa -9.153e+03 1.823e+04 -0.502 0.615618
## ExterCondGd -1.403e+04 1.733e+04 -0.810 0.418301
## ExterCondPo -4.359e+03 3.201e+04 -0.136 0.891718
## ExterCondTA -1.148e+04 1.730e+04 -0.664 0.507060
## EnclosedPorch 1.979e+00 1.242e+01 0.159 0.873401
## X1stFlrSF 1.108e+00 4.596e+00 0.241 0.809492
## HalfBath -7.676e+00 1.863e+03 -0.004 0.996712
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 570313259)
##
## Null deviance: 9.2079e+12 on 1459 degrees of freedom
## Residual deviance: 7.4597e+11 on 1308 degrees of freedom
## AIC: 33725
##
## Number of Fisher Scoring iterations: 2
The variables which p-values are greater than the significance level of 5% are considered to be statistically insignificant. Since statistically insignificant variables don’t affect the outcome they weren’t included to the equation that calculates predicted ‘SalePrice’.
Let’s perform likelihood ratio test to confirm that the full model is in favor of the alternative reduced model. Removing predictor variables from a model will almost always make the model fit less well, but it is necessary to test whether the observed difference in model fit is statistically significant.
The following hypothesis were stated:
The Null Hypothesis: reduced model is true.
The Alternative Hypothesis: reduced model is false.
alternative_model <- glm(formula =SalePrice ~ OverallQual + GrLivArea + Neighborhood + BsmtFinSF1 +
OverallCond + GarageCars + TotalBsmtSF + RoofMatl + YearBuilt +
LotArea + MSZoning + Condition2 + KitchenQual + SaleCondition +
Functional + BsmtUnfSF + Condition1 + CentralAir + Fireplaces +
Heating + HeatingQC + BsmtExposure + GarageCond + KitchenAbvGr +
YearRemodAdd + Exterior1st + BsmtQual + PoolArea + LotConfig +
ScreenPorch + Foundation + SaleType + BsmtFullBath, data = data)
anova(final_model, alternative_model, test ="Chisq")
## Analysis of Deviance Table
##
## Model 1: SalePrice ~ OverallQual + GrLivArea + Neighborhood + BsmtFinSF1 +
## OverallCond + GarageCars + TotalBsmtSF + RoofMatl + YearBuilt +
## LotArea + MSZoning + Condition2 + KitchenQual + SaleCondition +
## Functional + BsmtUnfSF + Condition1 + CentralAir + Fireplaces +
## Heating + HeatingQC + BsmtExposure + GarageCond + KitchenAbvGr +
## YearRemodAdd + Exterior1st + BsmtQual + PoolArea + LotConfig +
## ScreenPorch + Foundation + SaleType + BsmtFullBath + LandSlope +
## GarageArea + GarageType + GarageQual + LowQualFinSF + WoodDeckSF +
## BsmtFinSF2 + ExterCond + EnclosedPorch + X1stFlrSF + HalfBath
## Model 2: SalePrice ~ OverallQual + GrLivArea + Neighborhood + BsmtFinSF1 +
## OverallCond + GarageCars + TotalBsmtSF + RoofMatl + YearBuilt +
## LotArea + MSZoning + Condition2 + KitchenQual + SaleCondition +
## Functional + BsmtUnfSF + Condition1 + CentralAir + Fireplaces +
## Heating + HeatingQC + BsmtExposure + GarageCond + KitchenAbvGr +
## YearRemodAdd + Exterior1st + BsmtQual + PoolArea + LotConfig +
## ScreenPorch + Foundation + SaleType + BsmtFullBath
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 1308 7.4597e+11
## 2 1329 7.7571e+11 -21 -2.9738e+10 0.0001823 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The results of Likelihood Ratio Test shows that null hypothesis is regected (as p-value is less than significance level of 5%). It would provide evidence that the full model is in favor of the reduced model.
The optimal model is described by the following equation:
data$SalePrice_pred <- -1.510e+06 + 6.690e+03*data$OverallQual + 5.626e+01*data$GrLivArea
+ 1.813e+04*data$Neighborhood_BrkSide + 2.898e+04*data$Neighborhood_Crawfor + 5.146e+04*data$Neighborhood_NoRidge + 3.147e+04*data$Neighborhood_NridgHt + 1.786e+04*data$Neighborhood_Somerst + 4.429e+04*data$Neighborhood_StoneBr + 1.332e+01*data$BsmtFinSF1 + 5.883e+03*data$OverallCond + 2.445e+01*data$TotalBsmtSF + 6.733e+05*data$RoofMatl_CompShg + 7.387e+05*data$RoofMatl_Membran + 7.043e+05*data$RoofMatl_Metal + 6.717e+05*data$RoofMatl_Roll + 6.666e+05*data$RoofMatl_Tar_Grv + 6.948e+05*data$RoofMatl_WdShake + 7.256e+05*data$RoofMatl_WdShngl + 3.132e+02*data$YearBuilt + 6.026e-01*data$LotArea + 3.792e+04*data$MSZoning_FV + 3.662e+04*data$MSZoning_RH + 3.761e+04*data$MSZoning_RL + 2.636e+04*data$MSZoning_RM - 2.177e+05*data$Condition2_PosN - 2.507e+04*data$KitchenQual_Fa - 3.207e+04*data$KitchenQual_Gd - 2.960e+04*data$KitchenQual_TA + 7.095e+03*data$SaleCondition_Normal + 2.013e+04*data$Functional_Typ + 1.384e+04*data$Condition1_Norm + 2.686e+03*data$Fireplaces - 4.305e+03*data$HeatingQC_TA + 1.704e+04*data$BsmtExposure_Gd - 4.876e+03*data$BsmtExposure_No + 1.022e+05*data$GarageCond_Fa +1.004e+05*data$GarageCond_Gd + 1.131e+05*data$GarageCond_Po+1.063e+05*data$GarageCond_TA -1.923e+04*data$KitchenAbvGr -2.149e+04*data$BsmtQual_Fa - 2.901e+04*data$BsmtQual_Gd-2.518e+04*data$BsmtQual_TA + 9.275e+01*data$PoolArea + 7.456e+03*data$LotConfigCulDSac + 1.986e+01*data$GarageArea +2.655e+04*data$GarageType_Attchd +3.060e+04*data$GarageType_BuiltIn + 2.631e+04*data$GarageType_CarPort + 2.768e+04*data$GarageType_Detchd - 1.111e+058*data$GarageQual_Fa - 1.085e+05*data$GarageQualGd - 1.323e+05*data$GarageQual_Po -1.110e+05*data$GarageQual_TA - 5.826e+01*data$LowQualFinSF
The intercept of -1.510e+06 specifies that the ‘SalePrice’ equals to -1.510e+06 when all other variables equal to 0 (which looks unrealistic since the price can be negative). Also, for example, the slope coefficient of 9.275e+01 specifies that each sq feet increase in ‘PoolArea’ increases the home price by 9.275e+01 while keeping all other variables constant. In addition, for instance, the price is increased by 7.043e+05 if house roof made from metal.
The equation shows that not only quantitative variables affects ‘SalePrice’ but also categorical variables. In order to include categorical variables to a model the dummy variables were created.
#Neighborhood
data$Neighborhood_Crawfor <- ifelse(data$Neighborhood == "Crawfor",1,0)
data$Neighborhood_Edwards <- ifelse(data$Neighborhood == "Edwards",1,0)
data$Neighborhood_MeadowV <- ifelse(data$Neighborhood == "MeadowV",1,0)
data$Neighborhood_Mitchel <- ifelse(data$Neighborhood == "Mitchel",1,0)
data$Neighborhood_NWAmes <- ifelse(data$Neighborhood == "NWAmes",1,0)
data$Neighborhood_StoneBr <- ifelse(data$Neighborhood == "StoneBr",1,0)
data$Neighborhood_NoRidge <- ifelse(data$Neighborhood == "NoRidge",1,0)
data$Neighborhood_NridgHt <- ifelse(data$Neighborhood == "NridgHt",1,0)
data$Neighborhood_BrkSide <- ifelse(data$Neighborhood == "BrkSide",1,0)
data$Neighborhood_Somerst <- ifelse(data$Neighborhood == "Somerst",1,0)
#Roofmatl
data$RoofMatl_CompShg <- ifelse(data$RoofMatl == "CompShg",1,0)
data$RoofMatl_Membran <- ifelse(data$RoofMatl == "Membran",1,0)
data$RoofMatl_Metal <- ifelse(data$RoofMatl == "Metal",1,0)
data$RoofMatl_Roll <- ifelse(data$RoofMatl == "Roll",1,0)
data$RoofMatl_Tar_Grv <- ifelse(data$RoofMatl == "Tar&Grv",1,0)
data$RoofMatl_WdShake <- ifelse(data$RoofMatl == "WdShake",1,0)
data$RoofMatl_WdShngl <- ifelse(data$RoofMatl == "WdShngl",1,0)
data$RoofMatl_WdShngl <- ifelse(data$RoofMatl == "WdShngl",1,0)
#MSZoning
data$MSZoning_FV <- ifelse(data$MSZoning == "FV",1,0)
data$MSZoning_RH <- ifelse(data$MSZoning == "RH",1,0)
data$MSZoning_RL <- ifelse(data$MSZoning == "RL",1,0)
data$MSZoning_RM <- ifelse(data$MSZoning == "RM",1,0)
#Condition2
data$Condition2_PosA <- ifelse(data$Condition2 == "PosA",1,0)
data$Condition2_PosN <- ifelse(data$Condition2 == "PosN",1,0)
#KitchenQual
data$KitchenQual_Fa <- ifelse(data$KitchenQual == "Fa",1,0)
data$KitchenQual_Gd <- ifelse(data$KitchenQual == "Gd",1,0)
data$KitchenQual_TA <- ifelse(data$KitchenQual == "Fa",1,0)
#SaleCondition
data$SaleCondition_Normal <- ifelse(data$SaleCondition == "Normal",1,0)
data$SaleCondition_Partial <- ifelse(data$SaleCondition == "Partial",1,0)
#Functional
data$Functional_Maj2 <- ifelse(data$Functional == "Maj2",1,0)
data$Functional_Typ <- ifelse(data$Functional == "Typ",1,0)
#KitchenQual
data$KitchenQual_TA <- ifelse(data$KitchenQual == "Fa",1,0)
#Functional
data$Functional_Sev <- ifelse(data$Functional == "Sev",1,0)
data$Functional_Typ <- ifelse(data$Functional == "Typ",1,0)
#GarageQual
data$GarageQual_Fa <- ifelse(data$GarageQual == "Fa",1,0)
data$GarageQual_Gd <- ifelse(data$GarageQual == "Gd",1,0)
data$GarageQual_Po <- ifelse(data$GarageQual == "Po",1,0)
data$GarageQual_TA <- ifelse(data$GarageQual == "TA",1,0)
#Condition1
data$Condition1_Feedr <- ifelse(data$Condition1 == "Feedr",1,0)
data$Condition1_Norm <- ifelse(data$Condition1 == "Norm",1,0)
data$Condition1_PosN <- ifelse(data$Condition1 == "PosN",1,0)
data$Condition1_RRAe <- ifelse(data$Condition1 == "RRAe",1,0)
#CentralAir
data$CentralAir_Y <- ifelse(data$CentralAir == "Y",1,0)
#KitchenQual
data$KitchenQual_TA <- ifelse(data$KitchenQual == "Fa",1,0)
#HeatingQC
data$HeatingQC_Gd <- ifelse(data$HeatingQC == "Gd",1,0)
data$HeatingQC_TA <- ifelse(data$HeatingQC == "TA",1,0)
#BsmtExposure
data$BsmtExposure_Gd <- ifelse(data$BsmtExposure == "Gd",1,0)
data$BsmtExposure_Mn <- ifelse(data$BsmtExposure == "Mn",1,0)
data$BsmtExposure_No <- ifelse(data$BsmtExposure == "No",1,0)
#BsmtQual
data$BsmtQual_Gd <- ifelse(data$BsmtQual == "Gd",1,0)
data$BsmtQual_TA <- ifelse(data$BsmtQual == "TA",1,0)
data$BsmtQual_Fa <- ifelse(data$BsmtQual == "Fa",1,0)
#LotConfig
data$LotConfig_FR2 <- ifelse(data$LotConfig == "FR2",1,0)
data$LotConfig_Inside <- ifelse(data$LotConfig == "Inside",1,0)
data$LotConfig_CulDSac<- ifelse(data$LotConfig == "CulDSac",1,0)
#Foundation
data$Foundation_Stone <- ifelse(data$Foundation == "Stone",1,0)
data$Foundation_Wood <- ifelse(data$Foundation == "Wood",1,0)
data$Foundation_PConc <- ifelse(data$Foundation == "PConc",1,0)
#Exterior1st
data$Exterior1st_BrkComm <- ifelse(data$Exterior1st == "BrkComm",1,0)
data$Exterior1st_BrkFace <- ifelse(data$Exterior1st == "BrkFace",1,0)
#SaleType
data$SaleType_ConLD <- ifelse(data$SaleType == "ConLD",1,0)
data$SaleType_New <- ifelse(data$SaleType == "New",1,0)
#GarageQual
data$GarageQual_Gd <- ifelse(data$GarageQual == "Gd",1,0)
data$GarageQual_Po <- ifelse(data$GarageQual == "Po",1,0)
data$GarageQual_TA <- ifelse(data$GarageQual == "TA",1,0)
data$GarageQual_Fa <- ifelse(data$GarageQual == "Fa",1,0)
data$GarageQual_TA <- ifelse(data$GarageQual == "TA",1,0)
#GarageCond
data$GarageCond_Gd <- ifelse(data$GarageCond == "Gd",1,0)
data$GarageCond_Po <- ifelse(data$GarageCond == "Po",1,0)
data$GarageCond_TA <- ifelse(data$GarageCond == "TA",1,0)
data$GarageCond_Fa <- ifelse(data$GarageCond == "Fa",1,0)
data$GarageCond_Fa <- ifelse(data$GarageCond == "Fa",1,0)
#LandSlope
data$LandSlope_Sev <- ifelse(data$LandSlope == "Sev",1,0)
#Street
data$Street_Pave <- ifelse(data$Street == "Pave",1,0)
#GarageType
data$GarageType_Attchd <- ifelse(data$GarageType == "Attchd",1,0)
data$GarageType_BuiltIn <- ifelse(data$GarageType == "BuiltIn",1,0)
data$GarageType_CarPort <- ifelse(data$GarageType == "CarPort",1,0)
data$GarageType_Detchd <- ifelse(data$GarageType == "Detchd",1,0)
#LotConfig
data$LotConfig_CulDSac <- ifelse(data$LotConfig == "CulDSac",1,0)
Now we can calculate ‘SalePrice’ for training data set in order to compare in with actual ‘SalePrice’.
data$SalePrice_pred <- c()
data$SalePrice_pred <- -1.510e+06 + 6.690e+03*data$OverallQual + 5.626e+01*data$GrLivArea
+ 1.813e+04*data$Neighborhood_BrkSide + 2.898e+04*data$Neighborhood_Crawfor + 5.146e+04*data$Neighborhood_NoRidge + 3.147e+04*data$Neighborhood_NridgHt + 1.786e+04*data$Neighborhood_Somerst + 4.429e+04*data$Neighborhood_StoneBr + 1.332e+01*data$BsmtFinSF1 + 5.883e+03*data$OverallCond + 2.445e+01*data$TotalBsmtSF + 6.733e+05*data$RoofMatl_CompShg + 7.387e+05*data$RoofMatl_Membran + 7.043e+05*data$RoofMatl_Metal + 6.717e+05*data$RoofMatl_Roll + 6.666e+05*data$RoofMatl_Tar_Grv + 6.948e+05*data$RoofMatl_WdShake + 7.256e+05*data$RoofMatl_WdShngl + 3.132e+02*data$YearBuilt + 6.026e-01*data$LotArea + 3.792e+04*data$MSZoning_FV + 3.662e+04*data$MSZoning_RH + 3.761e+04*data$MSZoning_RL + 2.636e+04*data$MSZoning_RM - 2.177e+05*data$Condition2_PosN - 2.507e+04*data$KitchenQual_Fa - 3.207e+04*data$KitchenQual_Gd - 2.960e+04*data$KitchenQual_TA + 7.095e+03*data$SaleCondition_Normal + 2.013e+04*data$Functional_Typ + 1.384e+04*data$Condition1_Norm + 2.686e+03*data$Fireplaces - 4.305e+03*data$HeatingQC_TA + 1.704e+04*data$BsmtExposure_Gd - 4.876e+03*data$BsmtExposure_No + 1.022e+05*data$GarageCond_Fa +1.004e+05*data$GarageCond_Gd + 1.131e+05*data$GarageCond_Po+1.063e+05*data$GarageCond_TA -1.923e+04*data$KitchenAbvGr -2.149e+04*data$BsmtQual_Fa - 2.901e+04*data$BsmtQual_Gd-2.518e+04*data$BsmtQual_TA + 9.275e+01*data$PoolArea + 7.456e+03*data$LotConfig_CulDSac + 1.986e+01*data$GarageArea -3.149e+04*data$LandSlope_Sev +2.655e+04*data$GarageType_Attchd +3.060e+04*data$GarageType_BuiltIn + 2.631e+04*data$GarageType_CarPort + 2.768e+04*data$GarageType_Detchd - 1.111e+058*data$GarageQual_Fa - 1.085e+05*data$GarageQual_Gd - 1.323e+05*data$GarageQual_Po -1.110e+05*data$GarageQual_TA - 5.826e+01*data$LowQualFinSF
Let’s run Kolmogorov-Smirnov test to check how close predicted ‘SalePrice’ to actual ‘SalePrice’.
#sagnificance of difference
ks.test(data$SalePrice_pred,data$SalePrice)
##
## Two-sample Kolmogorov-Smirnov test
##
## data: data$SalePrice_pred and data$SalePrice
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided
Kolmogorov-Smirnov test rejected the null hypothesis that states that the difference between actual and predicted sales is not significant (as the p-value is less that the level of significance of 5%).
Let’s take a look at residual deviance, that shows how well the response is predicted by the model when the predictors are included. In addition, residual deviance is used to test whether a regression model provides an adequate fit for the data or not.
The following hypothesis were stated:
The Null Hypothesis: regression model provides an adequate fit for the data.
The Alternative Hypothesis: regression model doesn’t provide an adequate fit for the data.
#residual deviance test
p_value = 1 - pchisq(final_model$deviance,final_model$df.residual)
p_value
## [1] 0
As p-value is less than the significance level of 5% the null hypothesis is rejected. It suggests that the fitted values are significantly different from the observed values.
Another Hosmer-Lemeshow Test can be performed to compare actual outcome with predicted outcome.
The following hypothesis were stated:
The Null Hypothesis: there is no significance difference between observed and predicted values.
The Alternative Hypothesis: there is a significance difference between observed and predicted values.
hoslem.test(data$SalePrice, fitted(final_model))
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: data$SalePrice, fitted(final_model)
## X-squared = -2.1237, df = 8, p-value = 1
Our model doesn’t appear to fit well because we have significant difference between the model and the observed data as the p-value is above the significance level of 5%.
Let’s calculate SalePrice for testing data set.
data_testing$SalePrice <- c()
data$SalePrice <- -1.510e+06 + 6.690e+03*data$OverallQual + 5.626e+01*data$GrLivArea
+ 5.883e+03*data$OverallCond + 2.445e+01*data$TotalBsmtSF + 6.733e+05*data$RoofMatl_CompShg + 7.387e+05*data$RoofMatl_Membran + 7.043e+05*data$RoofMatl_Metal + 6.717e+05*data$RoofMatl_Roll + 6.666e+05*data$RoofMatl_Tar_Grv + 6.948e+05*data$RoofMatl_WdShake + 7.256e+05*data$RoofMatl_WdShngl + 3.132e+02*data$YearBuilt + 6.026e-01*data$LotArea + 3.792e+04*data$MSZoning_FV + 3.662e+04*data$MSZoning_RH + 3.761e+04*data$MSZoning_RL + 2.636e+04*data$MSZoning_RM - 2.177e+05*data$Condition2_PosN - 2.507e+04*data$KitchenQual_Fa - 3.207e+04*data$KitchenQual_Gd - 2.960e+04*data$KitchenQual_TA + 7.095e+03*data$SaleCondition_Normal + 2.013e+04*data$Functional_Typ + 1.384e+04*data$Condition1_Norm + 2.686e+03*data$Fireplaces - 4.305e+03*data$HeatingQC_TA + 1.704e+04*data$BsmtExposure_Gd - 4.876e+03*data$BsmtExposure_No + 1.022e+05*data$GarageCond_Fa +1.004e+05*data$GarageCond_Gd + 1.131e+05*data$GarageCond_Po+1.063e+05*data$GarageCond_TA -1.923e+04*data$KitchenAbvGr -2.149e+04*data$BsmtQual_Fa - 2.901e+04*data$BsmtQual_Gd-2.518e+04*data$BsmtQual_TA + 9.275e+01*data$PoolArea + 7.456e+03*data$LotConfig_CulDSac + 1.986e+01*data$GarageArea -3.149e+04*data$LandSlope_Sev +2.655e+04*data$GarageType_Attchd +3.060e+04*data$GarageType_BuiltIn + 2.631e+04*data$GarageType_CarPort + 2.768e+04*data$GarageType_Detchd - 1.111e+058*data$GarageQual_Fa - 1.085e+05*data$GarageQual_Gd - 1.323e+05*data$GarageQual_Po -1.110e+05*data$GarageQual_TA - 5.826e+01*data$LowQualFinSF