library(MASS)
setwd('C:/Users/Sanjiv/Desktop/Math547/Week4/')

d <- read.csv("house_selling_prices_OR.csv")
str(d)
## 'data.frame':    200 obs. of  12 variables:
##  $ House.Price    : num  232500 470000 150000 167500 210000 ...
##  $ HP.in.thousands: num  232 470 150 168 210 ...
##  $ House.Size     : int  1679 4494 2542 1094 1838 4156 3222 1556 2846 1596 ...
##  $ Acres          : num  0.23 0.52 0.11 0.18 0.19 0.22 0.16 0.16 1.08 0.04 ...
##  $ Lot.Size       : num  10019 22651 4792 7841 8276 ...
##  $ Bedrooms       : int  3 5 4 2 4 3 4 2 3 3 ...
##  $ T.Bath         : num  1.5 4 0 1 2 3.5 2.5 2 2 2.5 ...
##  $ Age            : int  35 38 5 65 33 3 6 34 43 9 ...
##  $ Garage         : int  1 1 1 0 1 1 1 1 1 1 ...
##  $ Condition      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age.Category   : Factor w/ 4 levels "M","N","O","VO": 1 1 2 3 1 2 2 1 1 2 ...
##  $ House.Age      : int  3 3 4 2 3 4 4 3 3 4 ...
# converting prices to numeric format
d$House.Price <- as.numeric(gsub("[$,]","", d$House.Price))

Create Binary Variable

Creating a binary variable price01 that contains 1 if House.Price is above its median, and 0 otherwise.

d$price01 <- ifelse(d$House.Price > median(d$House.Price), 1, 0) 
table(d$price01)
## 
##   0   1 
## 102  98

Exploring Relationship

Exploring relationships with price01 variable, and trying to identify important variables to predict price01.

boxplot(House.Price ~ price01, data = d, xlab = "price01", ylab = "House.Price")

boxplot(HP.in.thousands ~ price01, data = d, xlab = "price01", ylab = "HP.in.thousands")

The categories are clearly separated, since data is splitted by median value.

boxplot(House.Size ~ price01, data = d, xlab = "price01", ylab = "House.Size")

Larger houses are typically pricier.

boxplot(Acres ~ price01, data = d, xlab = "price01", ylab = "Acres")

boxplot(Lot.Size ~ price01, data = d, xlab = "price01", ylab = "Lot.Size")

boxplot(log(Acres) ~ price01, data = d, xlab = "price01", ylab = "log(Acres)")

More expensive houses tend to have slightly larger lot size. However, here we have severe skew of the distributions, a lot of outliers. Taking the log partly solves the issue but probably these are not good predictors since their distributions are far from Normal.

table(d$Bedrooms, d$price01)
##    
##      0  1
##   0  4  3
##   1  3  2
##   2 19  4
##   3 67 49
##   4  7 28
##   5  2  9
##   6  0  1
##   8  0  2
boxplot(Bedrooms ~ price01, data = d, xlab = "price01", ylab = "Bedrooms")

More expensive houses tend to have higher number of bedrooms.

table(d$T.Bath, d$price01)
##      
##        0  1
##   0    4  2
##   1   29  2
##   1.5 20  4
##   2   31 30
##   2.5 18 31
##   3    0 18
##   3.5  0  9
##   4    0  2
boxplot(T.Bath ~ price01, data = d, xlab = "price01", ylab = "T.Bath")

More expensive houses tend to have higher number of bathrooms.

boxplot(Age ~ price01, data = d, xlab = "price01", ylab = "Age")

Median age for more expensive houses is slightly lower.

table(d$Garage, d$price01)
##    
##      0  1
##   0 25 22
##   1 77 76

The distribution is about the same in both categories.

table(d$Condition, d$price01)
##    
##      0  1
##   0 90 84
##   1 12 14

The distribution is about the same in both categories.

table(d$Age.Category, d$price01)
##     
##       0  1
##   M  42 30
##   N  31 47
##   O  25 12
##   VO  4  9
boxplot(House.Age ~ price01, data = d, xlab = "price01", ylab = "House.Age")

Instead of categorical age predictors we will use more complete predictor, Age.

So important predictors are House.Size, Bedrooms, T.Bath, and Age.

Note - House.Price and HP.in.thousands were not used, since the price is assumed to be unknown, and we need to predict it.

Creating Training and Test Sets

Splitting the data to train and test sets (150 - train, 50 - test).

set.seed(28)
ids <- sample(x = 200, size = 150, replace = F)
train <- d[ids,]
test <- d[-ids,]

LDA

Performing LDA.

lda.fit <- lda(price01 ~ House.Size + Bedrooms + T.Bath + Age, data = train)
lda.fit
## Call:
## lda(price01 ~ House.Size + Bedrooms + T.Bath + Age, data = train)
## 
## Prior probabilities of groups:
##        0        1 
## 0.553333 0.446667 
## 
## Group means:
##   House.Size Bedrooms  T.Bath     Age
## 0    1793.86  2.75904 1.62048 35.8795
## 1    3419.82  3.32836 2.49254 32.8955
## 
## Coefficients of linear discriminants:
##                     LD1
## House.Size  0.000634223
## Bedrooms   -0.202447573
## T.Bath      1.206030974
## Age         0.010986818
lda.pred <- predict(lda.fit, test)
lda.class <- lda.pred$class

# confusion matrix
table(lda.class, test$price01)
##          
## lda.class  0  1
##         0 19 10
##         1  0 21
# test error rate
mean(lda.class != test$price01)
## [1] 0.2

Test error rate is 0.2.

QDA

Performing QDA.

qda.fit <- qda(price01 ~ House.Size + Bedrooms + T.Bath + Age, data = train)
qda.fit
## Call:
## qda(price01 ~ House.Size + Bedrooms + T.Bath + Age, data = train)
## 
## Prior probabilities of groups:
##        0        1 
## 0.553333 0.446667 
## 
## Group means:
##   House.Size Bedrooms  T.Bath     Age
## 0    1793.86  2.75904 1.62048 35.8795
## 1    3419.82  3.32836 2.49254 32.8955
qda.pred <- predict(qda.fit, test)
qda.class <- qda.pred$class

# confusion matrix
table(qda.class, test$price01)
##          
## qda.class  0  1
##         0 18  9
##         1  1 22
# test error rate
mean(qda.class != test$price01)
## [1] 0.2

Test error rate is 0.2, the same as with LDA but confusion matrix is slightly different.

Logistic Regression

Performing logistic regression.

log.fit <- glm(price01 ~ House.Size + Bedrooms + T.Bath + Age,
               data = train, family = "binomial")
summary(log.fit)
## 
## Call:
## glm(formula = price01 ~ House.Size + Bedrooms + T.Bath + Age, 
##     family = "binomial", data = train)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -2.687  -0.523  -0.144   0.288   2.040  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.597165   1.680354   -5.12  3.1e-07 ***
## House.Size   0.002360   0.000516    4.57  4.8e-06 ***
## Bedrooms    -0.898709   0.446739   -2.01   0.0443 *  
## T.Bath       2.343888   0.821734    2.85   0.0043 ** 
## Age          0.022850   0.013815    1.65   0.0981 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 206.234  on 149  degrees of freedom
## Residual deviance:  96.851  on 145  degrees of freedom
## AIC: 106.9
## 
## Number of Fisher Scoring iterations: 6
log.pred <- predict(log.fit, test, type = "response") # predict on 0..1 scale
log.class <- ifelse(log.pred > 0.5, 1, 0)

# confusion matrix
table(log.class, test$price01)
##          
## log.class  0  1
##         0 15  8
##         1  4 23
# test error rate
mean(log.class != test$price01)
## [1] 0.24

Test error rate is 0.24, higher than with previous methods.

Note that if we use House.Price variable, we will get some issues.

log.fit <- glm(price01 ~ House.Price + House.Size + Bedrooms + T.Bath + Age,
               data = train, family = "binomial")
summary(log.fit)
## 
## Call:
## glm(formula = price01 ~ House.Price + House.Size + Bedrooms + 
##     T.Bath + Age, family = "binomial", data = train)
## 
## Deviance Residuals: 
##       Min         1Q     Median         3Q        Max  
## -0.000266   0.000000   0.000000   0.000000   0.000330  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.50e+03   3.08e+05    0.00     1.00
## House.Price  6.26e-03   6.72e-01    0.01     0.99
## House.Size  -1.06e-02   4.28e+00    0.00     1.00
## Bedrooms     1.86e+00   9.01e+04    0.00     1.00
## T.Bath      -1.21e+01   1.64e+04    0.00     1.00
## Age          1.09e+00   2.72e+02    0.00     1.00
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2.0623e+02  on 149  degrees of freedom
## Residual deviance: 2.4250e-07  on 144  degrees of freedom
## AIC: 12
## 
## Number of Fisher Scoring iterations: 25

Received warning messages about non-convergence produced, since the issue is with
variable House.Price provides perfect separation. We can see that max value from 0 category is less than min value from 1 category:

max(train[train$price01 == 0,]$House.Price)
## [1] 242500
min(train[train$price01 == 1,]$House.Price)
## [1] 244000

let’s check the model predictions.

log.pred <- predict(log.fit, test, type = "response") # predict on 0..1 scale
log.class <- ifelse(log.pred > 0.5, 1, 0)

# confusion matrix
table(log.class, test$price01)
##          
## log.class  0  1
##         0 19  0
##         1  0 31
# test error rate
mean(log.class != test$price01)
## [1] 0

The prediction is perfect in this case.

Conclusion

LDA and QDA showed the same test error rate (0.2). Logistic regression method showed slightly higher test error rate (0.24). So LDA and QDA models are more appropriate in this case.B