House Prices Project

#required packages 
library(MASS)
library(car)
library(corrplot)
library(PerformanceAnalytics)
library(GGally)
library(RColorBrewer)
library(VIM)
library(dplyr)
library(mice)
library(pROC)
library(caret)
library(pscl)
library(ResourceSelection)
library(stringr)
library(vcd)
library(rcompanion)
library(lmtest)
library(fitdistrplus)
library(gmodels)
library(vembedr)

Link to sumbissions.csv file(that contains SalePrice prediction for testing dataset) : https://raw.githubusercontent.com/olga0503/DATA-621/master/submission.csv

INTRODUCTION

This project analyses the dataset that contains information on approximately 1,460 residential homes in Ames, Iowa. The objective of the project is to build a regression model on the training data set to predict final price of each home.

DATA EXPLORATION

The data set contains 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa:

#read training data set
data <- read.csv(file=
"https://raw.githubusercontent.com/olga0503/DATA-605/master/train.csv",
stringsAsFactors=T, header=T)

#read testing data set
data_testing <- read.csv(file=
"https://raw.githubusercontent.com/olga0503/DATA-605/master/test.csv",
stringsAsFactors=T, header=T)

#display first six entries
#head(data)
#head(data_testing)

#find dimentions
dim(data)

## [1] 1460   81

‘SalePrice’ - the property’s sale price in dollars. This is the target variable that you’re trying to predict.

‘MSSubClass’ - The building class

‘MSZoning’ - The general zoning classification

‘LotFrontage’ - Linear feet of street connected to property

‘LotArea’ - Lot size in square feet

‘Street’ - Type of road access

‘Alley’ - Type of alley access

‘LotShape’ - General shape of property

‘LandContour’ - Flatness of the property

‘Utilities’ - Type of utilities available

‘LotConfig’ - Lot configuration

‘LandSlope’ - Slope of property

‘Neighborhood’ - Physical locations within Ames city limits

‘Condition1’ - Proximity to main road or railroad

‘Condition2’ - Proximity to main road or railroad (if a second is present)

‘BldgType’- Type of dwelling

‘HouseStyle’ - Style of dwelling

‘OverallQual’ - Overall material and finish quality

‘OverallCond’ - Overall condition rating

‘YearBuilt’ - Original construction date

‘YearRemodAdd’ - Remodel date

‘RoofStyle’ - Type of roof

‘RoofMatl’ - Roof material

‘Exterior1st’ - Exterior covering on house

‘Exterior2nd’ - Exterior covering on house (if more than one material)

‘MasVnrType’ - Masonry veneer type

‘MasVnrArea’ - Masonry veneer area in square feet

‘ExterQual’ - Exterior material quality

‘ExterCond’ - Present condition of the material on the exterior

‘Foundation’- Type of foundation

‘BsmtQual’ - Height of the basement

‘BsmtCond’ - General condition of the basement

‘BsmtExposure’ - Walkout or garden level basement walls

‘BsmtFinType1’ - Quality of basement finished area

‘BsmtFinSF1’ - Type 1 finished square feet

‘BsmtFinType2’ - Quality of second finished area (if present)

‘BsmtFinSF2’ - Type 2 finished square feet

‘BsmtUnfSF’ - Unfinished square feet of basement area

‘TotalBsmtSF’ - Total square feet of basement area

‘Heating’ - Type of heating

‘HeatingQC’ - Heating quality and condition

‘CentralAir’ - Central air conditioning

‘Electrical’ - Electrical system

‘1stFlrSF’ - First Floor square feet

‘2ndFlrSF’ - Second floor square feet

‘LowQualFinSF’ - Low quality finished square feet (all floors)

‘GrLivArea’ - Above grade (ground) living area square feet

‘BsmtFullBath’ - Basement full bathrooms

‘BsmtHalfBath’ - Basement half bathrooms

‘FullBath’ - Full bathrooms above grade

‘HalfBath’ - Half baths above grade

‘Bedroom’ - Number of bedrooms above basement level

‘Kitchen’ - Number of kitchens

‘KitchenQual’ - Kitchen quality

‘TotRmsAbvGrd’ - Total rooms above grade (does not include bathrooms)

‘Functional’ - Home functionality rating

‘Fireplaces’ - Number of fireplaces

‘FireplaceQu’ - Fireplace quality

‘GarageType’ - Garage location

‘GarageYrBlt’ - Year garage was built

‘GarageFinish’ - Interior finish of the garage

‘GarageCars’ - Size of garage in car capacity

‘GarageArea’ - Size of garage in square feet

‘GarageQual’ - Garage quality

‘GarageCond’ - Garage condition

‘PavedDrive’ - Paved driveway

‘WoodDeckSF’ - Wood deck area in square feet

‘OpenPorchSF’ - Open porch area in square feet

‘EnclosedPorch’ - Enclosed porch area in square feet

‘3SsnPorch’ - Three season porch area in square feet

‘ScreenPorch’ - Screen porch area in square feet

‘PoolArea’ - Pool area in square feet

‘PoolQC’ - Pool quality

‘Fence’ - Fence quality

‘MiscFeature’ - Miscellaneous feature not covered in other categories

‘MiscVal’ - Value of miscellaneous feature

‘MoSold’ - Month Sold

‘YrSold’ - Year Sold

‘SaleType’ - Type of sale

‘SaleCondition’ - Condition of sale

The data set contains the variables of different types. The response variable ‘SalePrices’ is continuous quantitative variable. For example, the variables ‘Neighborhood’ and ‘RoofStyle’ are nominal categorical variables.

levels(factor(data$RoofStyle))

## [1] "Flat"    "Gable"   "Gambrel" "Hip"     "Mansard" "Shed"

Whereas the variable ‘SaleCondition’ and ‘GarageQual’ are ordinal categorical variables.

levels(factor(data$SaleCondition))

## [1] "Abnorml" "AdjLand" "Alloca"  "Family"  "Normal"  "Partial"

While the variables ‘CentralAir’ is a binary categorical variables that can take only “Y” or “N”.

levels(factor(data$CentralAir))

## [1] "N" "Y"

The variable ‘YrSold’ takes only 5 different values. So that, it was converted to categorical variable

levels(factor(data$YrSold))

## [1] "2006" "2007" "2008" "2009" "2010"

#convert the variables 'YrSold' from integer to factor since the variable can take only five different values
data <- data %>% mutate(YrSold=as.factor(YrSold))

## Warning: package 'bindrcpp' was built under R version 3.4.4

data_testing <- data_testing %>% mutate(YrSold=as.factor(YrSold))

Before performing summary statistics let’s start out by getting a high level view of what data is missing. The graph below depicts missing values for each variable in the data set.

#build function that counts missing values
count_nas <- function(data){
  
variable_name_column <- c()
number_missing_column <- c()

for (i in 2:ncol(data)){
  variable_name <- colnames(data[i])
  number_missing <- sum(is.na(data[i]))
  variable_name_column <- c(variable_name_column,variable_name)
  number_missing_column <- c(number_missing_column,number_missing)
}

missing_table <- data.frame(variable_name_column,number_missing_column)
missing_table <- missing_table %>% mutate(percentage=round(number_missing_column*100/nrow(data),2)) %>% arrange(desc(percentage))
missing_table
}


#chart for missing values
aggr(data[-1], prop = T, numbers = T, cex.axis=.5, cex.numbers = 0.1,
     ylab=c("Proportion of missingness","Missingness Pattern"),
     labels=names(data[-1]))

#count missing values
count_nas(data[2:length(data)-1])

##    variable_name_column number_missing_column percentage
## 1                PoolQC                  1453      99.52
## 2           MiscFeature                  1406      96.30
## 3                 Alley                  1369      93.77
## 4                 Fence                  1179      80.75
## 5           FireplaceQu                   690      47.26
## 6           LotFrontage                   259      17.74
## 7            GarageType                    81       5.55
## 8           GarageYrBlt                    81       5.55
## 9          GarageFinish                    81       5.55
## 10           GarageQual                    81       5.55
## 11           GarageCond                    81       5.55
## 12         BsmtExposure                    38       2.60
## 13         BsmtFinType2                    38       2.60
## 14             BsmtQual                    37       2.53
## 15             BsmtCond                    37       2.53
## 16         BsmtFinType1                    37       2.53
## 17           MasVnrType                     8       0.55
## 18           MasVnrArea                     8       0.55
## 19           Electrical                     1       0.07
## 20           MSSubClass                     0       0.00
## 21             MSZoning                     0       0.00
## 22              LotArea                     0       0.00
## 23               Street                     0       0.00
## 24             LotShape                     0       0.00
## 25          LandContour                     0       0.00
## 26            Utilities                     0       0.00
## 27            LotConfig                     0       0.00
## 28            LandSlope                     0       0.00
## 29         Neighborhood                     0       0.00
## 30           Condition1                     0       0.00
## 31           Condition2                     0       0.00
## 32             BldgType                     0       0.00
## 33           HouseStyle                     0       0.00
## 34          OverallQual                     0       0.00
## 35          OverallCond                     0       0.00
## 36            YearBuilt                     0       0.00
## 37         YearRemodAdd                     0       0.00
## 38            RoofStyle                     0       0.00
## 39             RoofMatl                     0       0.00
## 40          Exterior1st                     0       0.00
## 41          Exterior2nd                     0       0.00
## 42            ExterQual                     0       0.00
## 43            ExterCond                     0       0.00
## 44           Foundation                     0       0.00
## 45           BsmtFinSF1                     0       0.00
## 46           BsmtFinSF2                     0       0.00
## 47            BsmtUnfSF                     0       0.00
## 48          TotalBsmtSF                     0       0.00
## 49              Heating                     0       0.00
## 50            HeatingQC                     0       0.00
## 51           CentralAir                     0       0.00
## 52            X1stFlrSF                     0       0.00
## 53            X2ndFlrSF                     0       0.00
## 54         LowQualFinSF                     0       0.00
## 55            GrLivArea                     0       0.00
## 56         BsmtFullBath                     0       0.00
## 57         BsmtHalfBath                     0       0.00
## 58             FullBath                     0       0.00
## 59             HalfBath                     0       0.00
## 60         BedroomAbvGr                     0       0.00
## 61         KitchenAbvGr                     0       0.00
## 62          KitchenQual                     0       0.00
## 63         TotRmsAbvGrd                     0       0.00
## 64           Functional                     0       0.00
## 65           Fireplaces                     0       0.00
## 66           GarageCars                     0       0.00
## 67           GarageArea                     0       0.00
## 68           PavedDrive                     0       0.00
## 69           WoodDeckSF                     0       0.00
## 70          OpenPorchSF                     0       0.00
## 71        EnclosedPorch                     0       0.00
## 72           X3SsnPorch                     0       0.00
## 73          ScreenPorch                     0       0.00
## 74             PoolArea                     0       0.00
## 75              MiscVal                     0       0.00
## 76               MoSold                     0       0.00
## 77               YrSold                     0       0.00
## 78             SaleType                     0       0.00
## 79        SaleCondition                     0       0.00

count_nas(data_testing[2:length(data)-1])

##    variable_name_column number_missing_column percentage
## 1                PoolQC                  1456      99.79
## 2           MiscFeature                  1408      96.50
## 3                 Alley                  1352      92.67
## 4                 Fence                  1169      80.12
## 5           FireplaceQu                   730      50.03
## 6           LotFrontage                   227      15.56
## 7           GarageYrBlt                    78       5.35
## 8          GarageFinish                    78       5.35
## 9            GarageQual                    78       5.35
## 10           GarageCond                    78       5.35
## 11           GarageType                    76       5.21
## 12             BsmtCond                    45       3.08
## 13             BsmtQual                    44       3.02
## 14         BsmtExposure                    44       3.02
## 15         BsmtFinType1                    42       2.88
## 16         BsmtFinType2                    42       2.88
## 17           MasVnrType                    16       1.10
## 18           MasVnrArea                    15       1.03
## 19             MSZoning                     4       0.27
## 20            Utilities                     2       0.14
## 21         BsmtFullBath                     2       0.14
## 22         BsmtHalfBath                     2       0.14
## 23           Functional                     2       0.14
## 24          Exterior1st                     1       0.07
## 25          Exterior2nd                     1       0.07
## 26           BsmtFinSF1                     1       0.07
## 27           BsmtFinSF2                     1       0.07
## 28            BsmtUnfSF                     1       0.07
## 29          TotalBsmtSF                     1       0.07
## 30          KitchenQual                     1       0.07
## 31           GarageCars                     1       0.07
## 32           GarageArea                     1       0.07
## 33             SaleType                     1       0.07
## 34           MSSubClass                     0       0.00
## 35              LotArea                     0       0.00
## 36               Street                     0       0.00
## 37             LotShape                     0       0.00
## 38          LandContour                     0       0.00
## 39            LotConfig                     0       0.00
## 40            LandSlope                     0       0.00
## 41         Neighborhood                     0       0.00
## 42           Condition1                     0       0.00
## 43           Condition2                     0       0.00
## 44             BldgType                     0       0.00
## 45           HouseStyle                     0       0.00
## 46          OverallQual                     0       0.00
## 47          OverallCond                     0       0.00
## 48            YearBuilt                     0       0.00
## 49         YearRemodAdd                     0       0.00
## 50            RoofStyle                     0       0.00
## 51             RoofMatl                     0       0.00
## 52            ExterQual                     0       0.00
## 53            ExterCond                     0       0.00
## 54           Foundation                     0       0.00
## 55              Heating                     0       0.00
## 56            HeatingQC                     0       0.00
## 57           CentralAir                     0       0.00
## 58           Electrical                     0       0.00
## 59            X1stFlrSF                     0       0.00
## 60            X2ndFlrSF                     0       0.00
## 61         LowQualFinSF                     0       0.00
## 62            GrLivArea                     0       0.00
## 63             FullBath                     0       0.00
## 64             HalfBath                     0       0.00
## 65         BedroomAbvGr                     0       0.00
## 66         KitchenAbvGr                     0       0.00
## 67         TotRmsAbvGrd                     0       0.00
## 68           Fireplaces                     0       0.00
## 69           PavedDrive                     0       0.00
## 70           WoodDeckSF                     0       0.00
## 71          OpenPorchSF                     0       0.00
## 72        EnclosedPorch                     0       0.00
## 73           X3SsnPorch                     0       0.00
## 74          ScreenPorch                     0       0.00
## 75             PoolArea                     0       0.00
## 76              MiscVal                     0       0.00
## 77               MoSold                     0       0.00
## 78               YrSold                     0       0.00
## 79        SaleCondition                     0       0.00

The chart shows that the variable ‘PoolQC’, ‘MiscFeature’ and ‘Alley’ are missing more than 90% of data while the variable ‘Fence’ is missing about 80% of data. The variable ‘FireplaceQu’ is missing a little bit less than 50% of its data whereas the variable ‘LotFrontage’ is missing about 18% of its data. All other variables are either missing less that 5% of the data or don’t contain missing values.

DATA PREPARATION

Since the variable ‘PoolQC’, ‘MiscFeature’ and ‘Alley’ are missing more than 90% of the data they probably lost their statistical power. That is why the variables were removed from the data set.

#remove the variables 'PoolQC', 'MiscFeature' and 'Alley' from the dataset
data <- data %>% dplyr::select(-PoolQC,-MiscFeature,-Alley)
data_testing <- data_testing %>% dplyr::select(-PoolQC,-MiscFeature,-Alley)

The missing values of the variable ‘Fence’, that is missing about 80% of its data, can be predictive of outcome. So that, it was decided to replace missing values of ‘Fence’ with “NONE”.

data <- data %>% mutate(Fence = as.factor(ifelse(is.na(Fence),"NONE",Fence)))
data_testing <- data_testing %>% mutate(Fence = as.factor(ifelse(is.na(Fence),"NONE",Fence)))

Instead of throwing out missing values of the remaining variables containing missing data they were reasonably imputed. The method cart’ (classification and regression trees) that is also known as decision trees was chosen for multiple imputation. So that, now we’ve got a lot more data to feed to a model.

#apply multiple imputation for training data set
#exclude variable'ID'
exclude <- c('Id')
index <- data[1]
include <- setdiff(names(data), exclude)
data_include <- data[include]

#imputation with mean
imp.data <- mice(data_include, m=5, method='cart', printFlag=FALSE)

#merge imputed values with data frame
data <- mice::complete(imp.data)
data <- data.frame(index,data)

#confirm no NAs
count_nas(data)

#write data
write.csv(data, file = "/Users/olga/downloads/data_imp.csv", row.names = FALSE)

#apply multiple imputation for testing data set
#exclude variable'ID'
exclude <- c('Id')
index <- data_testing[1]
include <- setdiff(names(data_testing), exclude)
data_testing_include <- data_testing[include]

#imputation with mean
imp.data_testing <- mice(data_testing_include, m=5, method='cart', printFlag=FALSE)

#merge imputed values with data frame
data_testing <- mice::complete(imp.data_testing)
data_testing <- data.frame(index,data_testing)

#confirm no NAs
count_nas(data_testing)

#write data
write.csv(data_testing, file = "/Users/olga/downloads/data_testing_imp.csv", row.names = FALSE)

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right!

Pick the dependent variable and define it as Y.

Let’s define the independent variable ‘GrLivArea’ as X and the dependent variable ‘SalePrice’ as Y.

X <- data$GrLivArea
Y <- data$SalePrice

XY_df <- data.frame(X,Y)

#analyse distribution of X
par(mfrow=c(1,2))
plotNormalHistogram(X,main="GrLivArea")
qqnorm(X)
qqline(X)

#analyse distribution of Y
par(mfrow=c(1,2))
plotNormalHistogram(Y,main="SalePrice")
qqnorm(Y)
qqline(Y)

Based on histograms it can be concluded that the distributions of both variables are bell-shaped and unimodal. However, they are not symmetric and skewed to the right. It means that the means of both distributions are greater that their medians.

summary(XY_df)

##        X              Y         
##  Min.   : 334   Min.   : 34900  
##  1st Qu.:1130   1st Qu.:129975  
##  Median :1464   Median :163000  
##  Mean   :1515   Mean   :180921  
##  3rd Qu.:1777   3rd Qu.:214000  
##  Max.   :5642   Max.   :755000

Summary statistics confirmed that the mean is greater than the median for both distributions. According Q-Q plots, most of the points on the top and the bottom deviates from the line. The points form a curve instead of a straight line.

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

\(P(X>x | Y>y)\)

#calculate 3rd quartile of X 
X_3d_quantile <- quantile(X, probs=0.75) 
X_3d_quantile

##     75% 
## 1776.75

#calculate 2nd quartile of Y
Y_2d_quantile <- quantile(Y, probs=0.50) 


P1 <- nrow(subset(XY_df, X > X_3d_quantile && Y > Y_2d_quantile))/nrow(subset(XY_df, Y > Y_2d_quantile))
P1

## [1] 0

\(P(X>x, Y>y)\)

#calculate 3rd quartile of X 
X_3d_quantile <- quantile(X, probs=0.75) 
X_3d_quantile

##     75% 
## 1776.75

#calculate 2nd quartile of Y
Y_2d_quantile <- quantile(Y, probs=0.50) 


P2 <- nrow(subset(XY_df, X > X_3d_quantile,data=data && Y > Y_2d_quantile,data=data))/nrow(data)
P2

## [1] 0.25

\(P(X<x | Y>y)\)

#calculate 3rd quartile of X 
X_3d_quantile <- quantile(X, probs=0.75) 
X_3d_quantile

##     75% 
## 1776.75

#calculate 2nd quartile of Y
Y_2d_quantile <- quantile(Y, probs=0.50) 


P3 <- nrow(subset(XY_df, X < X_3d_quantile && Y > Y_2d_quantile))/nrow(subset(XY_df, Y > Y_2d_quantile))
P3

## [1] 2.005495

#find P(X<=x,Y<=y)
P4 <- nrow(subset(XY_df, X <= X_3d_quantile,data=data && Y <= Y_2d_quantile,data=data))/nrow(data)

#find P(X>x,Y<=y)
P5 <- nrow(subset(XY_df, X > X_3d_quantile,data=data && Y <= Y_2d_quantile,data=data))/nrow(data)

#find P(X<=x,Y>y)
P6 <- nrow(subset(XY_df, X <= X_3d_quantile,data=data && Y > Y_2d_quantile,data=data))/nrow(data)

first.column <- c("<=3rd quartile",">3rd quartile","Total")
second.column <- c(P4,P5, P4+P5)
third.column <- c(P6,P2,P6+P2)
forth.column <- c(P4+P6,P2+P5,P4+P6+P2+P5)

table <- data.frame(first.column,second.column,third.column,forth.column)
colnames(table) <- c("x/y","<=2d quartile",">2d quartile","Total")
table

##              x/y <=2d quartile >2d quartile Total
## 1 <=3rd quartile          0.75         0.75   1.5
## 2  >3rd quartile          0.25         0.25   0.5
## 3          Total          1.00         1.00   2.0

Does splitting the training data in this fashion make them independent? Let A be the new variable counting those observations above the 1st quartile for X, and let B be the new variable counting those observations above the 1st quartile for Y. Does P(A|B)=P(A)P(B)? Check mathematically, and then evaluate by running a Chi Square test for association.

Two discrete random variables A and B are called independent if: \(P(A,B) = P(A)P(B)\) \(P(A|B)=\frac{P(A,B)}{P(B)}=\frac{P(A)P(B)}{P(B)}=P(A)\)

#calculate 1st quartile of X 
X_1st_quantile <- quantile(X, probs=0.25) 
X_1st_quantile

##    25% 
## 1129.5

#calculate 1st quartile of Y
Y_1st_quantile <- quantile(Y, probs=0.25) 
Y_1st_quantile

##    25% 
## 129975

#define variables A and B
A <- subset(XY_df, X > X_1st_quantile)
B <- subset(XY_df, Y > Y_1st_quantile)

P_A <- nrow(A)/nrow(data)
P_B <- nrow(B)/nrow(data)
P_A_and_P_B <-nrow(subset(XY_df, X > X_3d_quantile && Y > Y_2d_quantile))/nrow(data)

cat("P(A)*P(B) =",P_A*P_B,"\n")

## P(A)*P(B) = 0.5625

P_A_given_P_B = P1

cat("P(A)|P(B) =",P_A_given_P_B,"\n")

## P(A)|P(B) = 0

cat("Does P(A,B) = P(A)P(B)?",P_A*P_B==P_A_and_P_B,"\n")

## Does P(A,B) = P(A)P(B)? FALSE

cat("Does P(A)|P(B)=P(A)?",P_A_given_P_B==P_A)

## Does P(A)|P(B)=P(A)? FALSE

Two variables are NOT independent as the condition \(P(A,B) = P(A)P(B)\) is not met.

Let’s state the following hypothesis for a Chi Square test.

\(H_0\) : Variable A is independent of variable B \(H_A\) : Variable A is NOT independent of variable B.

chisq.test(A,B)

## 
##  Pearson's Chi-squared test
## 
## data:  A
## X-squared = 179460, df = 1094, p-value < 2.2e-16

Since p-value is less that the level of significance of 5% the null hypothesis is rejected. The variables A and B are NOT independent.

Descriptive and Inferential Statistics

Provide univariate descriptive statistics and appropriate plots for the training data set. Provide a scatterplot of X and Y. Derive a correlation matrix for any THREE quantitative variables in the dataset. Test the hypotheses that the correlations between each pairwise set of variables is 0 and provide a 92% confidence interval. Discuss the meaning of your analysis. Would you be worried about familywise error? Why or why not?

Let’s select three independent variables - ‘OpenPorchSF’,‘WoodDeckSF’ and ‘Garage Area’.

#create data set that contains one dependent variable and three independent variables
data_set <- data %>% dplyr::select(SalePrice,OpenPorchSF,WoodDeckSF,GarageArea)

#create scatterplots
par(mfrow=c(1,3))
for (i in 2:ncol(data_set)) {
  if (is.numeric(data_set[,i]) == "TRUE"){
    plot(data_set$SalePrice ~ data_set[,i],main=names(data_set)[i], xlab=names(data_set)[i],col="blue")
    reg_line <- lm(data_set$SalePrice ~ data_set[,i])
    abline(reg_line,col="red")
  
}
}

The explanatory variable ‘Garage Area’ has strong linear relationships with response variable ‘SalePrice’ while two other variables ‘OpenPorchSF’ and ‘WoodDeckSF’ are in week linear relationships with response variable.

#create histograms and Q-Q plots
for (i in 2:ncol(data_set)) {
par(mfrow=c(1,2))
plotNormalHistogram(data_set[,i],main=names(data_set)[i])
qqnorm(data_set[,i],main=names(data_set)[i])
qqline(data_set[,i])
}

summary(data_set)

##    SalePrice       OpenPorchSF       WoodDeckSF       GarageArea    
##  Min.   : 34900   Min.   :  0.00   Min.   :  0.00   Min.   :   0.0  
##  1st Qu.:129975   1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.: 334.5  
##  Median :163000   Median : 25.00   Median :  0.00   Median : 480.0  
##  Mean   :180921   Mean   : 46.66   Mean   : 94.24   Mean   : 473.0  
##  3rd Qu.:214000   3rd Qu.: 68.00   3rd Qu.:168.00   3rd Qu.: 576.0  
##  Max.   :755000   Max.   :547.00   Max.   :857.00   Max.   :1418.0

#correlation between variables
correlation <- corrplot(cor(data_set), type = "upper", method = "number", tl.cex = 0.7, tl.col="black",number.cex = 0.5)

correlation

##             SalePrice OpenPorchSF WoodDeckSF GarageArea
## SalePrice   1.0000000  0.31585623 0.32441344  0.6234314
## OpenPorchSF 0.3158562  1.00000000 0.05866061  0.2414347
## WoodDeckSF  0.3244134  0.05866061 1.00000000  0.2246663
## GarageArea  0.6234314  0.24143467 0.22466631  1.0000000

The correlation plot shows that neither of the independent variables are strongly correlated with another independent variables. It suggests that potentially all three independent variables can be included to the regression model. The variables that are not strongly correlated with each other will not cost model overfitting.

In order to verify the correlation between independent variables let’s state the following hypothesis:

\(H_0\): There is no correlation \(\beta_1 = 0\) \(H_a\): There is a correlation \(\beta_1 \neq 0\)

t.test(data_set$OpenPorchSF,data_set$WoodDeckSF,conf.level = 0.92,data=data_set)

## 
##  Welch Two Sample t-test
## 
## data:  data_set$OpenPorchSF and data_set$WoodDeckSF
## t = -12.825, df = 2215.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
##  -54.08293 -41.08556
## sample estimates:
## mean of x mean of y 
##  46.66027  94.24452

t.test(data_set$OpenPorchSF,data_set$GarageArea,conf.level = 0.92,data=data_set)

## 
##  Welch Two Sample t-test
## 
## data:  data_set$OpenPorchSF and data_set$GarageArea
## t = -72.775, df = 1736.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
##  -436.5815 -416.0583
## sample estimates:
## mean of x mean of y 
##  46.66027 472.98014

t.test(data_set$WoodDeckSF,data_set$GarageArea,conf.level = 0.92,data=data_set)

## 
##  Welch Two Sample t-test
## 
## data:  data_set$WoodDeckSF and data_set$GarageArea
## t = -58.391, df = 2355.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 92 percent confidence interval:
##  -390.0957 -367.3755
## sample estimates:
## mean of x mean of y 
##  94.24452 472.98014

The T-test showed that the p-values of all three test is less that the level of significance of 5%. So that, the null hypothesis is rejected. There are correlation between all three pairs of independent variables.

Linear Algebra and Correlation

Invert your 3 x 3 correlation matrix from above. (This is known as the precision matrix and contains variance inflation factors on the diagonal.) Multiply the correlation matrix by the precision matrix, and then multiply the precision matrix by the correlation matrix. Conduct LU decomposition on the matrix.

cotrelation_matrix <- matrix(c(correlation[2,2], correlation[2,3], correlation[2,4], 
                            correlation[3,2], correlation[3,3], correlation[3,4],
                            correlation[4,2], correlation[4,3], correlation[4,4]),
                            nrow=3,  ncol=3, byrow = TRUE)

cotrelation_matrix

##            [,1]       [,2]      [,3]
## [1,] 1.00000000 0.05866061 0.2414347
## [2,] 0.05866061 1.00000000 0.2246663
## [3,] 0.24143467 0.22466631 1.0000000

#inverse matrix
precision_matrix <- ginv(cotrelation_matrix)

#identity matrix
round(precision_matrix %*% cotrelation_matrix,2)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

round(cotrelation_matrix %*% precision_matrix,2)

##      [,1] [,2] [,3]
## [1,]    1    0    0
## [2,]    0    1    0
## [3,]    0    0    1

#create LU decomposition function
LU <- function(A,row_num,col_num) {

#create matrix L that consist of 0s    
L = matrix(c(0:0),nrow=row_num,  ncol=col_num, byrow = TRUE)

#assign 1s to the main diaginal
i=1

for (i in  1:3){
L[i,i]=1
} 

#assign matrix A to initial matrix U
U <- A
print("matrix A")
print(A)

#start from the first column
col=1

#iterate until col_num-1
for (col in  1:(col_num-1)){

#start to iterate from the second row  until final row 
row=col+1

  for (row in  (col+1):row_num){
    #U[row,col] = U[row,col]+x*U[col,col]    
    x=-U[row,col]/U[col,col]
    U[row,col] = 0
    
    #multiply the remaining elements of a current row by x 
    col2=col+1
    for (col2 in  (col+1):col_num){

    U[row,col2] = U[row,col2]+x*U[col,col2] 
    print("matrix U")
    print(U)   
    
  }
    
    #assign x to a corresponding cell in matrix L
    L[row,col] = -x
    
  } 
   
  
}

print("matrix L")
print(L)
#dot product of L and U should produce A
print("matrix LU=A")
print(L %*% U)
  
}

LU(cotrelation_matrix,3,3)

## [1] "matrix A"
##            [,1]       [,2]      [,3]
## [1,] 1.00000000 0.05866061 0.2414347
## [2,] 0.05866061 1.00000000 0.2246663
## [3,] 0.24143467 0.22466631 1.0000000
## [1] "matrix U"
##           [,1]       [,2]      [,3]
## [1,] 1.0000000 0.05866061 0.2414347
## [2,] 0.0000000 0.99655893 0.2246663
## [3,] 0.2414347 0.22466631 1.0000000
## [1] "matrix U"
##           [,1]       [,2]      [,3]
## [1,] 1.0000000 0.05866061 0.2414347
## [2,] 0.0000000 0.99655893 0.2105036
## [3,] 0.2414347 0.22466631 1.0000000
## [1] "matrix U"
##      [,1]       [,2]      [,3]
## [1,]    1 0.05866061 0.2414347
## [2,]    0 0.99655893 0.2105036
## [3,]    0 0.21050360 1.0000000
## [1] "matrix U"
##      [,1]       [,2]      [,3]
## [1,]    1 0.05866061 0.2414347
## [2,]    0 0.99655893 0.2105036
## [3,]    0 0.21050360 0.9417093
## [1] "matrix U"
##      [,1]       [,2]      [,3]
## [1,]    1 0.05866061 0.2414347
## [2,]    0 0.99655893 0.2105036
## [3,]    0 0.00000000 0.8972445
## [1] "matrix L"
##            [,1]      [,2] [,3]
## [1,] 1.00000000 0.0000000    0
## [2,] 0.05866061 1.0000000    0
## [3,] 0.24143467 0.2112305    1
## [1] "matrix LU=A"
##            [,1]       [,2]      [,3]
## [1,] 1.00000000 0.05866061 0.2414347
## [2,] 0.05866061 1.00000000 0.2246663
## [3,] 0.24143467 0.22466631 1.0000000

Calculus-Based Probability & Statistics

Many times, it makes sense to fit a closed form distribution to data. For the first variable that you selected which is skewed to the right, shift it so that the minimum value is above zero as necessary. Then load the MASS package and run fitdistr to fit an exponential probability density function. (See https://stat.ethz.ch/R-manual/R- devel/library/MASS/html/fitdistr.html ). Find the optimal value of λ for this distribution, and then take 1000 samples from this exponential distribution using this value (e.g., rexp(1000, λ)). Plot a histogram and compare it with a histogram of your original variable. Using the exponential pdf, find the 5th and 95th percentiles using the cumulative distribution function (CDF). Also generate a 95% confidence interval from the empirical data, assuming normality. Finally, provide the empirical 5th percentile and 95th percentile of the data. Discuss.

fit_exp <- fitdistr(data$GrLivArea, "exponential")

#optimal value of λ for this distribution
lambda <- fit_exp$estimate
sample <- rexp(1000, lambda)
lambda

##        rate 
## 0.000659864

#histogram
sample<-data.frame(as.numeric(sample))
colnames(sample)[1] <- "sample"
str(sample)

## 'data.frame':    1000 obs. of  1 variable:
##  $ sample: num  1532.8 4923.8 238.3 554.7 16.3 ...

hist(sample$sample, main="Histogram of Exponential Distribution", xlab = "GrLivArea", breaks=30)

hist(sample$sample, main="Histogram of GrLivArea", xlab = "GrLivArea", breaks=30)

The histograms of exponential distribution and original distribution of ‘GrLivArea’ are similar.

#5th percentile
qexp(.05,rate = lambda)

## [1] 77.73313

#95th percentile
qexp(.95, rate = lambda)

## [1] 4539.924

#95% confidence interval from the empirical data
ci(data$GrLivArea, 0.95)

##   Estimate   CI lower   CI upper Std. Error 
## 1515.46370 1488.48701 1542.44038   13.75245

‘GrLivArea’ falls into interval of (1488.48701,1542.44038) in 95% of the cases.

#5th percentile of GrLivArea
quantile(data$GrLivArea, .05)

##  5% 
## 848

#95th percentile of GrLivArea
quantile(data$GrLivArea, .95)

##    95% 
## 2466.1

Modeling

Since the response variable is continuous it was decided to use GLM to build the model that predicts ‘SalePrice’.

The following assumptions must be verified for linear regression.

Independence of observations. It was assumed that all observations are independent.
Linearity assumption. Linear regression needs the relationship between the independent and dependent variables to be linear. Also, outliers need to be checked.

par(mfrow=c(2,2))
#colnames <- dimnames(data)[[2]]
for (i in 2:(ncol(data)-1)) {
  if (is.numeric(data[,i]) == "TRUE"){
    plot(data$SalePrice ~ data[,i],main=names(data)[i], xlab=names(data)[i],col="blue")
    reg_line <- lm(data$SalePrice ~ data[,i])
    abline(reg_line,col="red")
    
    
}
}

#ggplot(data, aes(SalePrice,LotFrontage)) + stat_smooth(method="loess") + stat_smooth(method="lm", color="red", fill="red", alpha=.25) + geom_point()

By looking at scatter plots, it can be concluded that the variables ‘Ms Subclass’, ‘OverallCond’, ‘BsmtFinSF1’,‘LowQualFinSF’, ‘BsmtHalfBath’ are in non-linear relationships with the response variable ‘SalePrice’. Thus, these variables need to be transformed.

Normal distribution of explanatory variables. The linear regression analysis requires all predictor variables to be multivariate normal.

# histograms, density lines and normal probability plots
for (i in 2:ncol(data)) {
  if (is.numeric(data[,i]) == "TRUE"){
par(mfrow=c(1,2))
plotNormalHistogram(data[,i],main=names(data)[i])
qqnorm(data[,i],main=names(data)[i])
qqline(data[,i])
}}

Only variables ‘OverallQual’ and ‘GarageCars’ are close to normal distribution as their distributions are bell-shaped and symmetric. All other variables are either skewed to the right or skewed to the left. The variables that don’t meet normality assumption must be transformed.

Multicollinearity assumption. Generilized linear regression model assumes that there is little or no multicollinearity in the data. Multicollinearity occurs when the independent variables are too highly correlated with each other.

#correlation between variables
corrplot(cor(data[2:ncol(data)] %>% select_if(is.numeric)), type = "upper", method = "number", tl.cex = 0.7, tl.col="black",number.cex = 0.5)

According to the correlation matrix, a few variables in the data set are strongly correlated with each other. For example, ‘GarageYrBlt’ and ‘YrBlt’ are strongly correlated (as usually garage is built in the same year house is built). Only one of these two variables should be included in the model. Otherwise, the model will be overfitted.

Homoscedasticity assumption. The residuals should be equal across the regression line.

model = glm(SalePrice ~., data = data)
#Breush Pagan Test
bptest(model)

## 
##  studentized Breusch-Pagan test
## 
## data:  model
## BP = 867.76, df = 241, p-value < 2.2e-16

Breush Pagan Test p-value is less than a significance level of 5%, therefore we can reject the null hypothesis that the variance of the residuals is constant and infer that heteroscedasticity is present.

Only variables “OverallQual” and “GarageCars” meet assumptions requirments. All other variables that don’t meet either of the assumption requirements were transformed. The transformation were performed based Box-Cox method.

#create dataframe to store modified data
data_m <- data

#resolve problem with non-negative values
for (i in 2:(ncol(data_m)-1)){
  
  for (j in 1:nrow(data_m)){
if (is.numeric(data_m[j,i])==TRUE && data_m[j,i] == 0 ) {
  data_m[j,i]=data_m[j,i]+0.01}
  }
}

head(data_m)

#data transformation
for (i in 2:ncol(data_m)) {
  

  
  if (colnames(data_m[i])!="OverallQual" || colnames(data_m[i])!="GarageCars"){
  
  if (is.numeric(data_m[,i])==TRUE){
    
print(colnames(data_m[i]))   
  
Box = boxcox(data_m[,i] ~ 1,               # Transform Turbidity as a single vector
             lambda = seq(-10,10,1)      # Try values -10 to 10 by 0.1
             )
Cox = data.frame(Box$x, Box$y)            # Create a data frame with the results
Cox2 = Cox[with(Cox, order(-Cox$Box.y)),] # Order the new data frame by decreasing y
#Cox2[1,]                                  # Display the lambda with the greatest
                                          #    log likelihood
lambda = Cox2[1, "Box.x"]                 # Extract that lambda
print(lambda)
data_m[,i] = (data_m[,i] ^ lambda - 1)/lambda   # Transform the original data

#plotNormalHistogram(data[,i],main=names(data)[i])
#qqnorm(data[,i],main=names(data)[i])
#qqline(data[,i])
  }
  }

}

In order to find the best regression model it was decided to implement the stepwise approach that analyses all combination of variables and selects the best regression model based on lowest AIC (Akaike’s criterion) value. Lower values of AIC indicate the preferred model, that is, the one with the fewest parameters that still provides an adequate fit to the data.

#build glm model using stepwise approach
linear_model.null = glm(SalePrice ~ 1, data = data_m)
#linear_model.null <- glm(SalePrice ~ 1, family=gaussian(link="log"), data = data)

linear_model.full = glm(SalePrice ~ ., data = data_m)
#linear_model.full <- glm(SalePrice ~ ., data = data, family=gaussian(link="log"))
     
step(linear_model.null,
     scope = list(upper=linear_model.full),
             direction = "both",
             data = data_m)

Let’s review summary statistics for the optimal model (model with lowest AIC).

final_model <- glm(formula = SalePrice ~ OverallQual + GrLivArea + Neighborhood + 
    BsmtFinSF1 + OverallCond + GarageCars + TotalBsmtSF + RoofMatl + 
    YearBuilt + LotArea + MSZoning + Condition2 + KitchenQual + 
    SaleCondition + Functional + BsmtUnfSF + Condition1 + CentralAir + 
    Fireplaces + Heating + HeatingQC + BsmtExposure + GarageCond + 
    KitchenAbvGr + YearRemodAdd + Exterior1st + BsmtQual + PoolArea + 
    LotConfig + ScreenPorch + Foundation + SaleType + BsmtFullBath + 
    LandSlope + GarageArea + GarageType + GarageQual + LowQualFinSF + 
    WoodDeckSF + BsmtFinSF2 + ExterCond + EnclosedPorch + X1stFlrSF + 
    HalfBath, data = data)

summary(final_model)

## 
## Call:
## glm(formula = SalePrice ~ OverallQual + GrLivArea + Neighborhood + 
##     BsmtFinSF1 + OverallCond + GarageCars + TotalBsmtSF + RoofMatl + 
##     YearBuilt + LotArea + MSZoning + Condition2 + KitchenQual + 
##     SaleCondition + Functional + BsmtUnfSF + Condition1 + CentralAir + 
##     Fireplaces + Heating + HeatingQC + BsmtExposure + GarageCond + 
##     KitchenAbvGr + YearRemodAdd + Exterior1st + BsmtQual + PoolArea + 
##     LotConfig + ScreenPorch + Foundation + SaleType + BsmtFullBath + 
##     LandSlope + GarageArea + GarageType + GarageQual + LowQualFinSF + 
##     WoodDeckSF + BsmtFinSF2 + ExterCond + EnclosedPorch + X1stFlrSF + 
##     HalfBath, data = data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -187686   -10406      203    10194   187686  
## 
## Coefficients: (1 not defined because of singularities)
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.510e+06  1.750e+05  -8.627  < 2e-16 ***
## OverallQual           6.690e+03  9.857e+02   6.787 1.73e-11 ***
## GrLivArea             5.626e+01  2.687e+00  20.937  < 2e-16 ***
## NeighborhoodBlueste   5.692e+03  1.896e+04   0.300 0.764050    
## NeighborhoodBrDale    5.597e+03  1.022e+04   0.548 0.583905    
## NeighborhoodBrkSide   1.813e+04  8.318e+03   2.179 0.029487 *  
## NeighborhoodClearCr   4.694e+03  8.416e+03   0.558 0.577099    
## NeighborhoodCollgCr   8.634e+03  6.415e+03   1.346 0.178562    
## NeighborhoodCrawfor   2.898e+04  7.785e+03   3.722 0.000206 ***
## NeighborhoodEdwards  -1.962e+03  7.232e+03  -0.271 0.786234    
## NeighborhoodGilbert   8.265e+03  6.807e+03   1.214 0.224912    
## NeighborhoodIDOTRR    1.782e+04  9.605e+03   1.856 0.063732 .  
## NeighborhoodMeadowV  -6.387e+03  1.073e+04  -0.595 0.551632    
## NeighborhoodMitchel  -5.713e+03  7.385e+03  -0.774 0.439306    
## NeighborhoodNAmes     1.524e+03  6.988e+03   0.218 0.827423    
## NeighborhoodNoRidge   5.146e+04  7.558e+03   6.808 1.50e-11 ***
## NeighborhoodNPkVill   9.193e+03  1.081e+04   0.850 0.395208    
## NeighborhoodNridgHt   3.147e+04  6.866e+03   4.583 5.02e-06 ***
## NeighborhoodNWAmes    1.689e+03  7.271e+03   0.232 0.816313    
## NeighborhoodOldTown   1.075e+04  8.409e+03   1.278 0.201521    
## NeighborhoodSawyer    6.341e+03  7.372e+03   0.860 0.389876    
## NeighborhoodSawyerW   1.246e+04  7.141e+03   1.744 0.081339 .  
## NeighborhoodSomerst   1.786e+04  8.241e+03   2.167 0.030448 *  
## NeighborhoodStoneBr   4.429e+04  7.982e+03   5.549 3.47e-08 ***
## NeighborhoodSWISU     3.684e+03  9.004e+03   0.409 0.682514    
## NeighborhoodTimber    3.457e+03  7.405e+03   0.467 0.640732    
## NeighborhoodVeenker   1.153e+04  1.019e+04   1.132 0.258027    
## BsmtFinSF1            1.332e+01  4.516e+00   2.950 0.003238 ** 
## OverallCond           5.883e+03  8.514e+02   6.909 7.58e-12 ***
## GarageCars            3.001e+03  2.206e+03   1.360 0.173980    
## TotalBsmtSF           2.445e+01  5.593e+00   4.371 1.33e-05 ***
## RoofMatlCompShg       6.733e+05  3.049e+04  22.081  < 2e-16 ***
## RoofMatlMembran       7.387e+05  4.204e+04  17.570  < 2e-16 ***
## RoofMatlMetal         7.043e+05  4.119e+04  17.100  < 2e-16 ***
## RoofMatlRoll          6.717e+05  3.960e+04  16.964  < 2e-16 ***
## RoofMatlTar&Grv       6.666e+05  3.169e+04  21.039  < 2e-16 ***
## RoofMatlWdShake       6.948e+05  3.310e+04  20.994  < 2e-16 ***
## RoofMatlWdShngl       7.256e+05  3.188e+04  22.758  < 2e-16 ***
## YearBuilt             3.132e+02  7.124e+01   4.397 1.19e-05 ***
## LotArea               6.026e-01  9.483e-02   6.354 2.88e-10 ***
## MSZoningFV            3.792e+04  1.203e+04   3.152 0.001660 ** 
## MSZoningRH            3.662e+04  1.208e+04   3.032 0.002480 ** 
## MSZoningRL            3.761e+04  1.014e+04   3.710 0.000216 ***
## MSZoningRM            2.636e+04  9.540e+03   2.763 0.005803 ** 
## Condition2Feedr       2.765e+03  2.269e+04   0.122 0.903010    
## Condition2Norm       -7.113e+02  1.947e+04  -0.037 0.970866    
## Condition2PosA        6.267e+04  3.581e+04   1.750 0.080341 .  
## Condition2PosN       -2.177e+05  2.740e+04  -7.944 4.19e-15 ***
## Condition2RRAe       -2.555e+04  3.154e+04  -0.810 0.417955    
## Condition2RRAn       -2.020e+04  3.158e+04  -0.640 0.522453    
## Condition2RRNn        1.037e+04  2.673e+04   0.388 0.698050    
## KitchenQualFa        -2.507e+04  6.034e+03  -4.155 3.46e-05 ***
## KitchenQualGd        -3.207e+04  3.343e+03  -9.596  < 2e-16 ***
## KitchenQualTA        -2.960e+04  3.810e+03  -7.770 1.58e-14 ***
## SaleConditionAdjLand  1.538e+04  1.330e+04   1.156 0.247828    
## SaleConditionAlloca   5.194e+03  8.447e+03   0.615 0.538692    
## SaleConditionFamily   5.630e+02  6.203e+03   0.091 0.927699    
## SaleConditionNormal   7.095e+03  2.911e+03   2.437 0.014927 *  
## SaleConditionPartial -6.102e+03  1.523e+04  -0.401 0.688778    
## FunctionalMaj2        7.526e+03  1.350e+04   0.557 0.577400    
## FunctionalMin1        8.934e+03  8.289e+03   1.078 0.281315    
## FunctionalMin2        9.165e+03  8.238e+03   1.112 0.266150    
## FunctionalMod         2.157e+03  1.002e+04   0.215 0.829639    
## FunctionalSev        -3.922e+04  2.841e+04  -1.380 0.167671    
## FunctionalTyp         2.013e+04  7.101e+03   2.835 0.004657 ** 
## BsmtUnfSF            -5.785e+00  4.689e+00  -1.234 0.217560    
## Condition1Feedr       8.007e+03  4.990e+03   1.605 0.108836    
## Condition1Norm        1.384e+04  4.100e+03   3.375 0.000761 ***
## Condition1PosA        7.346e+03  9.953e+03   0.738 0.460620    
## Condition1PosN        7.136e+03  7.357e+03   0.970 0.332243    
## Condition1RRAe       -1.321e+04  9.273e+03  -1.424 0.154625    
## Condition1RRAn        5.242e+03  6.813e+03   0.769 0.441782    
## Condition1RRNe        3.304e+03  1.799e+04   0.184 0.854319    
## Condition1RRNn        5.062e+03  1.254e+04   0.404 0.686440    
## CentralAirY          -2.202e+03  3.672e+03  -0.600 0.548896    
## Fireplaces            2.686e+03  1.340e+03   2.004 0.045293 *  
## HeatingGasA          -4.495e+03  2.547e+04  -0.176 0.859961    
## HeatingGasW          -9.147e+03  2.622e+04  -0.349 0.727257    
## HeatingGrav          -2.373e+02  2.736e+04  -0.009 0.993081    
## HeatingOthW          -3.408e+04  3.133e+04  -1.088 0.276851    
## HeatingWall           8.715e+03  2.902e+04   0.300 0.763962    
## HeatingQCFa          -3.420e+03  4.612e+03  -0.741 0.458530    
## HeatingQCGd          -3.845e+03  2.102e+03  -1.829 0.067591 .  
## HeatingQCPo           7.007e+03  2.634e+04   0.266 0.790254    
## HeatingQCTA          -4.305e+03  2.078e+03  -2.071 0.038509 *  
## BsmtExposureGd        1.704e+04  2.962e+03   5.754 1.09e-08 ***
## BsmtExposureMn       -2.404e+03  2.933e+03  -0.820 0.412519    
## BsmtExposureNo       -4.876e+03  2.040e+03  -2.390 0.017009 *  
## GarageCondFa          1.022e+05  3.448e+04   2.965 0.003080 ** 
## GarageCondGd          1.004e+05  3.547e+04   2.831 0.004716 ** 
## GarageCondPo          1.131e+05  3.576e+04   3.162 0.001605 ** 
## GarageCondTA          1.063e+05  3.412e+04   3.114 0.001885 ** 
## KitchenAbvGr         -1.923e+04  3.764e+03  -5.110 3.71e-07 ***
## YearRemodAdd          9.116e+01  5.334e+01   1.709 0.087666 .  
## Exterior1stAsphShn   -9.359e+02  2.579e+04  -0.036 0.971055    
## Exterior1stBrkComm   -3.794e+03  2.010e+04  -0.189 0.850319    
## Exterior1stBrkFace    7.675e+03  7.175e+03   1.070 0.284939    
## Exterior1stCBlock    -1.671e+04  2.556e+04  -0.654 0.513315    
## Exterior1stCemntBd    1.991e+02  7.498e+03   0.027 0.978822    
## Exterior1stHdBoard   -9.290e+03  6.544e+03  -1.420 0.155957    
## Exterior1stImStucc   -2.709e+04  2.529e+04  -1.071 0.284345    
## Exterior1stMetalSd   -5.972e+03  6.352e+03  -0.940 0.347322    
## Exterior1stPlywood   -1.323e+04  6.869e+03  -1.927 0.054235 .  
## Exterior1stStone     -2.866e+04  2.005e+04  -1.429 0.153120    
## Exterior1stStucco    -2.523e+03  8.012e+03  -0.315 0.752860    
## Exterior1stVinylSd   -3.637e+03  6.405e+03  -0.568 0.570192    
## Exterior1stWd Sdng   -8.038e+03  6.344e+03  -1.267 0.205431    
## Exterior1stWdShing   -7.834e+03  7.923e+03  -0.989 0.322969    
## BsmtQualFa           -2.149e+04  6.141e+03  -3.499 0.000483 ***
## BsmtQualGd           -2.901e+04  3.237e+03  -8.962  < 2e-16 ***
## BsmtQualTA           -2.518e+04  4.039e+03  -6.234 6.11e-10 ***
## PoolArea              9.275e+01  1.804e+01   5.142 3.13e-07 ***
## LotConfigCulDSac      7.456e+03  3.152e+03   2.366 0.018150 *  
## LotConfigFR2         -7.694e+03  4.035e+03  -1.907 0.056791 .  
## LotConfigFR3         -1.909e+04  1.293e+04  -1.477 0.140046    
## LotConfigInside      -1.357e+03  1.768e+03  -0.768 0.442904    
## ScreenPorch           2.100e+01  1.245e+01   1.687 0.091807 .  
## FoundationCBlock      1.480e+03  3.167e+03   0.467 0.640461    
## FoundationPConc       3.313e+03  3.462e+03   0.957 0.338749    
## FoundationSlab        1.185e+04  7.626e+03   1.554 0.120503    
## FoundationStone       1.037e+04  1.075e+04   0.965 0.334650    
## FoundationWood       -1.760e+04  1.474e+04  -1.194 0.232753    
## SaleTypeCon           2.680e+04  1.819e+04   1.474 0.140855    
## SaleTypeConLD         8.499e+03  9.723e+03   0.874 0.382212    
## SaleTypeConLI         3.772e+03  1.180e+04   0.320 0.749289    
## SaleTypeConLw         2.598e+03  1.208e+04   0.215 0.829730    
## SaleTypeCWD           6.664e+03  1.319e+04   0.505 0.613552    
## SaleTypeNew           3.049e+04  1.579e+04   1.931 0.053649 .  
## SaleTypeOth           1.527e+04  1.486e+04   1.028 0.304280    
## SaleTypeWD           -7.195e+02  4.243e+03  -0.170 0.865378    
## BsmtFullBath          9.024e+01  1.804e+03   0.050 0.960118    
## LandSlopeMod          3.513e+03  3.536e+03   0.993 0.320684    
## LandSlopeSev         -3.149e+04  1.015e+04  -3.103 0.001956 ** 
## GarageArea            1.986e+01  7.569e+00   2.624 0.008786 ** 
## GarageTypeAttchd      2.655e+04  1.073e+04   2.475 0.013452 *  
## GarageTypeBasment     2.080e+04  1.215e+04   1.712 0.087092 .  
## GarageTypeBuiltIn     3.060e+04  1.113e+04   2.749 0.006061 ** 
## GarageTypeCarPort     2.631e+04  1.338e+04   1.966 0.049533 *  
## GarageTypeDetchd      2.768e+04  1.065e+04   2.599 0.009459 ** 
## GarageQualFa         -1.111e+05  2.950e+04  -3.767 0.000173 ***
## GarageQualGd         -1.085e+05  3.029e+04  -3.582 0.000353 ***
## GarageQualPo         -1.323e+05  3.414e+04  -3.875 0.000112 ***
## GarageQualTA         -1.110e+05  2.924e+04  -3.795 0.000154 ***
## LowQualFinSF         -5.826e+01  1.555e+01  -3.746 0.000188 ***
## WoodDeckSF            1.011e+01  5.882e+00   1.719 0.085876 .  
## BsmtFinSF2                   NA         NA      NA       NA    
## ExterCondFa          -9.153e+03  1.823e+04  -0.502 0.615618    
## ExterCondGd          -1.403e+04  1.733e+04  -0.810 0.418301    
## ExterCondPo          -4.359e+03  3.201e+04  -0.136 0.891718    
## ExterCondTA          -1.148e+04  1.730e+04  -0.664 0.507060    
## EnclosedPorch         1.979e+00  1.242e+01   0.159 0.873401    
## X1stFlrSF             1.108e+00  4.596e+00   0.241 0.809492    
## HalfBath             -7.676e+00  1.863e+03  -0.004 0.996712    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 570313259)
## 
##     Null deviance: 9.2079e+12  on 1459  degrees of freedom
## Residual deviance: 7.4597e+11  on 1308  degrees of freedom
## AIC: 33725
## 
## Number of Fisher Scoring iterations: 2

The variables which p-values are greater than the significance level of 5% are considered to be statistically insignificant. Since statistically insignificant variables don’t affect the outcome they weren’t included to the equation that calculates predicted ‘SalePrice’.

Let’s perform likelihood ratio test to confirm that the full model is in favor of the alternative reduced model. Removing predictor variables from a model will almost always make the model fit less well, but it is necessary to test whether the observed difference in model fit is statistically significant.

The following hypothesis were stated:

The Null Hypothesis: reduced model is true.

The Alternative Hypothesis: reduced model is false.

alternative_model <- glm(formula =SalePrice ~ OverallQual + GrLivArea + Neighborhood + BsmtFinSF1 + 
    OverallCond + GarageCars + TotalBsmtSF + RoofMatl + YearBuilt + 
    LotArea + MSZoning + Condition2 + KitchenQual + SaleCondition + 
    Functional + BsmtUnfSF + Condition1 + CentralAir + Fireplaces + 
    Heating + HeatingQC + BsmtExposure + GarageCond + KitchenAbvGr + 
    YearRemodAdd + Exterior1st + BsmtQual + PoolArea + LotConfig + 
    ScreenPorch + Foundation + SaleType + BsmtFullBath, data = data)

anova(final_model, alternative_model, test ="Chisq")

## Analysis of Deviance Table
## 
## Model 1: SalePrice ~ OverallQual + GrLivArea + Neighborhood + BsmtFinSF1 + 
##     OverallCond + GarageCars + TotalBsmtSF + RoofMatl + YearBuilt + 
##     LotArea + MSZoning + Condition2 + KitchenQual + SaleCondition + 
##     Functional + BsmtUnfSF + Condition1 + CentralAir + Fireplaces + 
##     Heating + HeatingQC + BsmtExposure + GarageCond + KitchenAbvGr + 
##     YearRemodAdd + Exterior1st + BsmtQual + PoolArea + LotConfig + 
##     ScreenPorch + Foundation + SaleType + BsmtFullBath + LandSlope + 
##     GarageArea + GarageType + GarageQual + LowQualFinSF + WoodDeckSF + 
##     BsmtFinSF2 + ExterCond + EnclosedPorch + X1stFlrSF + HalfBath
## Model 2: SalePrice ~ OverallQual + GrLivArea + Neighborhood + BsmtFinSF1 + 
##     OverallCond + GarageCars + TotalBsmtSF + RoofMatl + YearBuilt + 
##     LotArea + MSZoning + Condition2 + KitchenQual + SaleCondition + 
##     Functional + BsmtUnfSF + Condition1 + CentralAir + Fireplaces + 
##     Heating + HeatingQC + BsmtExposure + GarageCond + KitchenAbvGr + 
##     YearRemodAdd + Exterior1st + BsmtQual + PoolArea + LotConfig + 
##     ScreenPorch + Foundation + SaleType + BsmtFullBath
##   Resid. Df Resid. Dev  Df    Deviance  Pr(>Chi)    
## 1      1308 7.4597e+11                              
## 2      1329 7.7571e+11 -21 -2.9738e+10 0.0001823 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The results of Likelihood Ratio Test shows that null hypothesis is regected (as p-value is less than significance level of 5%). It would provide evidence that the full model is in favor of the reduced model.

The optimal model is described by the following equation:

data$SalePrice_pred <- -1.510e+06 + 6.690e+03*data$OverallQual + 5.626e+01*data$GrLivArea 
+ 1.813e+04*data$Neighborhood_BrkSide + 2.898e+04*data$Neighborhood_Crawfor + 5.146e+04*data$Neighborhood_NoRidge + 3.147e+04*data$Neighborhood_NridgHt + 1.786e+04*data$Neighborhood_Somerst + 4.429e+04*data$Neighborhood_StoneBr + 1.332e+01*data$BsmtFinSF1 + 5.883e+03*data$OverallCond + 2.445e+01*data$TotalBsmtSF + 6.733e+05*data$RoofMatl_CompShg + 7.387e+05*data$RoofMatl_Membran + 7.043e+05*data$RoofMatl_Metal + 6.717e+05*data$RoofMatl_Roll + 6.666e+05*data$RoofMatl_Tar_Grv + 6.948e+05*data$RoofMatl_WdShake + 7.256e+05*data$RoofMatl_WdShngl + 3.132e+02*data$YearBuilt + 6.026e-01*data$LotArea + 3.792e+04*data$MSZoning_FV + 3.662e+04*data$MSZoning_RH + 3.761e+04*data$MSZoning_RL + 2.636e+04*data$MSZoning_RM - 2.177e+05*data$Condition2_PosN - 2.507e+04*data$KitchenQual_Fa - 3.207e+04*data$KitchenQual_Gd - 2.960e+04*data$KitchenQual_TA + 7.095e+03*data$SaleCondition_Normal + 2.013e+04*data$Functional_Typ + 1.384e+04*data$Condition1_Norm + 2.686e+03*data$Fireplaces - 4.305e+03*data$HeatingQC_TA  + 1.704e+04*data$BsmtExposure_Gd - 4.876e+03*data$BsmtExposure_No + 1.022e+05*data$GarageCond_Fa +1.004e+05*data$GarageCond_Gd + 1.131e+05*data$GarageCond_Po+1.063e+05*data$GarageCond_TA -1.923e+04*data$KitchenAbvGr -2.149e+04*data$BsmtQual_Fa - 2.901e+04*data$BsmtQual_Gd-2.518e+04*data$BsmtQual_TA + 9.275e+01*data$PoolArea + 7.456e+03*data$LotConfigCulDSac + 1.986e+01*data$GarageArea +2.655e+04*data$GarageType_Attchd +3.060e+04*data$GarageType_BuiltIn + 2.631e+04*data$GarageType_CarPort + 2.768e+04*data$GarageType_Detchd - 1.111e+058*data$GarageQual_Fa - 1.085e+05*data$GarageQualGd - 1.323e+05*data$GarageQual_Po -1.110e+05*data$GarageQual_TA - 5.826e+01*data$LowQualFinSF

The intercept of -1.510e+06 specifies that the ‘SalePrice’ equals to -1.510e+06 when all other variables equal to 0 (which looks unrealistic since the price can be negative). Also, for example, the slope coefficient of 9.275e+01 specifies that each sq feet increase in ‘PoolArea’ increases the home price by 9.275e+01 while keeping all other variables constant. In addition, for instance, the price is increased by 7.043e+05 if house roof made from metal.

The equation shows that not only quantitative variables affects ‘SalePrice’ but also categorical variables. In order to include categorical variables to a model the dummy variables were created.

#Neighborhood
data$Neighborhood_Crawfor <- ifelse(data$Neighborhood == "Crawfor",1,0)
data$Neighborhood_Edwards <- ifelse(data$Neighborhood == "Edwards",1,0)
data$Neighborhood_MeadowV <- ifelse(data$Neighborhood == "MeadowV",1,0)
data$Neighborhood_Mitchel <- ifelse(data$Neighborhood == "Mitchel",1,0)
data$Neighborhood_NWAmes <- ifelse(data$Neighborhood == "NWAmes",1,0)
data$Neighborhood_StoneBr <- ifelse(data$Neighborhood == "StoneBr",1,0)
data$Neighborhood_NoRidge <- ifelse(data$Neighborhood == "NoRidge",1,0)
data$Neighborhood_NridgHt <- ifelse(data$Neighborhood == "NridgHt",1,0)
data$Neighborhood_BrkSide <- ifelse(data$Neighborhood == "BrkSide",1,0)
data$Neighborhood_Somerst <- ifelse(data$Neighborhood == "Somerst",1,0)
 
#Roofmatl
data$RoofMatl_CompShg <- ifelse(data$RoofMatl == "CompShg",1,0)
data$RoofMatl_Membran <- ifelse(data$RoofMatl == "Membran",1,0)
data$RoofMatl_Metal <- ifelse(data$RoofMatl == "Metal",1,0)
data$RoofMatl_Roll  <- ifelse(data$RoofMatl == "Roll",1,0)
data$RoofMatl_Tar_Grv <- ifelse(data$RoofMatl == "Tar&Grv",1,0)
data$RoofMatl_WdShake <- ifelse(data$RoofMatl == "WdShake",1,0)
data$RoofMatl_WdShngl <- ifelse(data$RoofMatl == "WdShngl",1,0)
data$RoofMatl_WdShngl <- ifelse(data$RoofMatl == "WdShngl",1,0)

#MSZoning
data$MSZoning_FV <- ifelse(data$MSZoning == "FV",1,0)
data$MSZoning_RH <- ifelse(data$MSZoning == "RH",1,0)
data$MSZoning_RL <- ifelse(data$MSZoning == "RL",1,0)
data$MSZoning_RM <- ifelse(data$MSZoning == "RM",1,0)

#Condition2
data$Condition2_PosA <- ifelse(data$Condition2 == "PosA",1,0)
data$Condition2_PosN <- ifelse(data$Condition2 == "PosN",1,0)

#KitchenQual
data$KitchenQual_Fa <- ifelse(data$KitchenQual == "Fa",1,0)
data$KitchenQual_Gd <- ifelse(data$KitchenQual == "Gd",1,0)
data$KitchenQual_TA <- ifelse(data$KitchenQual == "Fa",1,0)

#SaleCondition
data$SaleCondition_Normal <- ifelse(data$SaleCondition == "Normal",1,0)
data$SaleCondition_Partial <- ifelse(data$SaleCondition == "Partial",1,0)

#Functional
data$Functional_Maj2 <- ifelse(data$Functional == "Maj2",1,0)
data$Functional_Typ <- ifelse(data$Functional == "Typ",1,0)

#KitchenQual
data$KitchenQual_TA <- ifelse(data$KitchenQual == "Fa",1,0)

#Functional
data$Functional_Sev <- ifelse(data$Functional == "Sev",1,0)
data$Functional_Typ <- ifelse(data$Functional == "Typ",1,0)

#GarageQual
data$GarageQual_Fa <- ifelse(data$GarageQual == "Fa",1,0)
data$GarageQual_Gd <- ifelse(data$GarageQual == "Gd",1,0)
data$GarageQual_Po <- ifelse(data$GarageQual == "Po",1,0)
data$GarageQual_TA <- ifelse(data$GarageQual == "TA",1,0)

#Condition1
data$Condition1_Feedr <- ifelse(data$Condition1 == "Feedr",1,0)
data$Condition1_Norm <- ifelse(data$Condition1 == "Norm",1,0)
data$Condition1_PosN <- ifelse(data$Condition1 == "PosN",1,0)
data$Condition1_RRAe <- ifelse(data$Condition1 == "RRAe",1,0)

#CentralAir
data$CentralAir_Y <- ifelse(data$CentralAir == "Y",1,0)

#KitchenQual
data$KitchenQual_TA <- ifelse(data$KitchenQual == "Fa",1,0)

#HeatingQC
data$HeatingQC_Gd <- ifelse(data$HeatingQC == "Gd",1,0)
data$HeatingQC_TA <- ifelse(data$HeatingQC == "TA",1,0)   

#BsmtExposure
data$BsmtExposure_Gd <- ifelse(data$BsmtExposure == "Gd",1,0)
data$BsmtExposure_Mn <- ifelse(data$BsmtExposure == "Mn",1,0)
data$BsmtExposure_No <- ifelse(data$BsmtExposure == "No",1,0)

#BsmtQual
data$BsmtQual_Gd <- ifelse(data$BsmtQual == "Gd",1,0)
data$BsmtQual_TA <- ifelse(data$BsmtQual == "TA",1,0)
data$BsmtQual_Fa <- ifelse(data$BsmtQual == "Fa",1,0)

#LotConfig
data$LotConfig_FR2 <- ifelse(data$LotConfig == "FR2",1,0)
data$LotConfig_Inside <- ifelse(data$LotConfig == "Inside",1,0)
data$LotConfig_CulDSac<- ifelse(data$LotConfig == "CulDSac",1,0)

#Foundation
data$Foundation_Stone <- ifelse(data$Foundation == "Stone",1,0)
data$Foundation_Wood <- ifelse(data$Foundation == "Wood",1,0)
data$Foundation_PConc <- ifelse(data$Foundation == "PConc",1,0)

#Exterior1st
data$Exterior1st_BrkComm <- ifelse(data$Exterior1st == "BrkComm",1,0)
data$Exterior1st_BrkFace <- ifelse(data$Exterior1st == "BrkFace",1,0)

#SaleType
data$SaleType_ConLD <- ifelse(data$SaleType == "ConLD",1,0)
data$SaleType_New <- ifelse(data$SaleType == "New",1,0)

#GarageQual
data$GarageQual_Gd <- ifelse(data$GarageQual  == "Gd",1,0)  
data$GarageQual_Po <- ifelse(data$GarageQual  == "Po",1,0) 
data$GarageQual_TA <- ifelse(data$GarageQual  == "TA",1,0) 
data$GarageQual_Fa <- ifelse(data$GarageQual  == "Fa",1,0) 
data$GarageQual_TA <- ifelse(data$GarageQual  == "TA",1,0) 

#GarageCond
data$GarageCond_Gd <- ifelse(data$GarageCond  == "Gd",1,0) 
data$GarageCond_Po <- ifelse(data$GarageCond  == "Po",1,0) 
data$GarageCond_TA <- ifelse(data$GarageCond  == "TA",1,0) 
data$GarageCond_Fa <- ifelse(data$GarageCond  == "Fa",1,0) 
data$GarageCond_Fa <- ifelse(data$GarageCond  == "Fa",1,0) 

#LandSlope
data$LandSlope_Sev <- ifelse(data$LandSlope  == "Sev",1,0) 

#Street
data$Street_Pave <- ifelse(data$Street  == "Pave",1,0) 

#GarageType
data$GarageType_Attchd <- ifelse(data$GarageType  == "Attchd",1,0) 
data$GarageType_BuiltIn <- ifelse(data$GarageType  == "BuiltIn",1,0) 
data$GarageType_CarPort <- ifelse(data$GarageType  == "CarPort",1,0) 
data$GarageType_Detchd <- ifelse(data$GarageType  == "Detchd",1,0) 

#LotConfig
data$LotConfig_CulDSac <- ifelse(data$LotConfig  == "CulDSac",1,0)

Now we can calculate ‘SalePrice’ for training data set in order to compare in with actual ‘SalePrice’.

data$SalePrice_pred  <- c()

data$SalePrice_pred <- -1.510e+06 + 6.690e+03*data$OverallQual + 5.626e+01*data$GrLivArea 
+ 1.813e+04*data$Neighborhood_BrkSide + 2.898e+04*data$Neighborhood_Crawfor + 5.146e+04*data$Neighborhood_NoRidge + 3.147e+04*data$Neighborhood_NridgHt + 1.786e+04*data$Neighborhood_Somerst + 4.429e+04*data$Neighborhood_StoneBr + 1.332e+01*data$BsmtFinSF1 + 5.883e+03*data$OverallCond + 2.445e+01*data$TotalBsmtSF + 6.733e+05*data$RoofMatl_CompShg + 7.387e+05*data$RoofMatl_Membran + 7.043e+05*data$RoofMatl_Metal + 6.717e+05*data$RoofMatl_Roll + 6.666e+05*data$RoofMatl_Tar_Grv + 6.948e+05*data$RoofMatl_WdShake + 7.256e+05*data$RoofMatl_WdShngl + 3.132e+02*data$YearBuilt + 6.026e-01*data$LotArea + 3.792e+04*data$MSZoning_FV + 3.662e+04*data$MSZoning_RH + 3.761e+04*data$MSZoning_RL + 2.636e+04*data$MSZoning_RM - 2.177e+05*data$Condition2_PosN - 2.507e+04*data$KitchenQual_Fa - 3.207e+04*data$KitchenQual_Gd - 2.960e+04*data$KitchenQual_TA + 7.095e+03*data$SaleCondition_Normal + 2.013e+04*data$Functional_Typ + 1.384e+04*data$Condition1_Norm + 2.686e+03*data$Fireplaces - 4.305e+03*data$HeatingQC_TA  + 1.704e+04*data$BsmtExposure_Gd - 4.876e+03*data$BsmtExposure_No + 1.022e+05*data$GarageCond_Fa +1.004e+05*data$GarageCond_Gd + 1.131e+05*data$GarageCond_Po+1.063e+05*data$GarageCond_TA -1.923e+04*data$KitchenAbvGr -2.149e+04*data$BsmtQual_Fa - 2.901e+04*data$BsmtQual_Gd-2.518e+04*data$BsmtQual_TA + 9.275e+01*data$PoolArea + 7.456e+03*data$LotConfig_CulDSac + 1.986e+01*data$GarageArea -3.149e+04*data$LandSlope_Sev +2.655e+04*data$GarageType_Attchd +3.060e+04*data$GarageType_BuiltIn + 2.631e+04*data$GarageType_CarPort + 2.768e+04*data$GarageType_Detchd - 1.111e+058*data$GarageQual_Fa - 1.085e+05*data$GarageQual_Gd - 1.323e+05*data$GarageQual_Po -1.110e+05*data$GarageQual_TA - 5.826e+01*data$LowQualFinSF

Let’s run Kolmogorov-Smirnov test to check how close predicted ‘SalePrice’ to actual ‘SalePrice’.

#sagnificance of difference
ks.test(data$SalePrice_pred,data$SalePrice)

## 
##  Two-sample Kolmogorov-Smirnov test
## 
## data:  data$SalePrice_pred and data$SalePrice
## D = 1, p-value < 2.2e-16
## alternative hypothesis: two-sided

Kolmogorov-Smirnov test rejected the null hypothesis that states that the difference between actual and predicted sales is not significant (as the p-value is less that the level of significance of 5%).

Let’s take a look at residual deviance, that shows how well the response is predicted by the model when the predictors are included. In addition, residual deviance is used to test whether a regression model provides an adequate fit for the data or not.

The following hypothesis were stated:

The Null Hypothesis: regression model provides an adequate fit for the data.

The Alternative Hypothesis: regression model doesn’t provide an adequate fit for the data.

#residual deviance test
p_value = 1 - pchisq(final_model$deviance,final_model$df.residual)
p_value

## [1] 0

As p-value is less than the significance level of 5% the null hypothesis is rejected. It suggests that the fitted values are significantly different from the observed values.

Another Hosmer-Lemeshow Test can be performed to compare actual outcome with predicted outcome.

The following hypothesis were stated:

The Null Hypothesis: there is no significance difference between observed and predicted values.

The Alternative Hypothesis: there is a significance difference between observed and predicted values.

hoslem.test(data$SalePrice, fitted(final_model))

## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  data$SalePrice, fitted(final_model)
## X-squared = -2.1237, df = 8, p-value = 1

Our model doesn’t appear to fit well because we have significant difference between the model and the observed data as the p-value is above the significance level of 5%.

Let’s calculate SalePrice for testing data set.

data_testing$SalePrice  <- c()

data$SalePrice <- -1.510e+06 + 6.690e+03*data$OverallQual + 5.626e+01*data$GrLivArea 
+ 5.883e+03*data$OverallCond + 2.445e+01*data$TotalBsmtSF + 6.733e+05*data$RoofMatl_CompShg + 7.387e+05*data$RoofMatl_Membran + 7.043e+05*data$RoofMatl_Metal + 6.717e+05*data$RoofMatl_Roll + 6.666e+05*data$RoofMatl_Tar_Grv + 6.948e+05*data$RoofMatl_WdShake + 7.256e+05*data$RoofMatl_WdShngl + 3.132e+02*data$YearBuilt + 6.026e-01*data$LotArea + 3.792e+04*data$MSZoning_FV + 3.662e+04*data$MSZoning_RH + 3.761e+04*data$MSZoning_RL + 2.636e+04*data$MSZoning_RM - 2.177e+05*data$Condition2_PosN - 2.507e+04*data$KitchenQual_Fa - 3.207e+04*data$KitchenQual_Gd - 2.960e+04*data$KitchenQual_TA + 7.095e+03*data$SaleCondition_Normal + 2.013e+04*data$Functional_Typ + 1.384e+04*data$Condition1_Norm + 2.686e+03*data$Fireplaces - 4.305e+03*data$HeatingQC_TA  + 1.704e+04*data$BsmtExposure_Gd - 4.876e+03*data$BsmtExposure_No + 1.022e+05*data$GarageCond_Fa +1.004e+05*data$GarageCond_Gd + 1.131e+05*data$GarageCond_Po+1.063e+05*data$GarageCond_TA -1.923e+04*data$KitchenAbvGr -2.149e+04*data$BsmtQual_Fa - 2.901e+04*data$BsmtQual_Gd-2.518e+04*data$BsmtQual_TA + 9.275e+01*data$PoolArea + 7.456e+03*data$LotConfig_CulDSac + 1.986e+01*data$GarageArea -3.149e+04*data$LandSlope_Sev +2.655e+04*data$GarageType_Attchd +3.060e+04*data$GarageType_BuiltIn + 2.631e+04*data$GarageType_CarPort + 2.768e+04*data$GarageType_Detchd - 1.111e+058*data$GarageQual_Fa - 1.085e+05*data$GarageQual_Gd - 1.323e+05*data$GarageQual_Po -1.110e+05*data$GarageQual_TA - 5.826e+01*data$LowQualFinSF

House Prices Project

Olya Fomicheva

5/20/2018

INTRODUCTION

DATA EXPLORATION

DATA PREPARATION

Pick one of the quantitative independent variables from the training data set (train.csv) , and define that variable as X. Make sure this variable is skewed to the right!

Pick the dependent variable and define it as Y.

Probability

Calculate as a minimum the below probabilities a through c. Assume the small letter “x” is estimated as the 1st quartile of the X variable, and the small letter “y” is estimated as the 1st quartile of the Y variable. Interpret the meaning of all probabilities.

Descriptive and Inferential Statistics

Linear Algebra and Correlation

Calculus-Based Probability & Statistics

Modeling