library(MASS)
setwd('C:/Users/Sanjiv/Desktop/Math547/Week4/')
d <- read.csv("house_selling_prices_OR.csv")
str(d)
## 'data.frame': 200 obs. of 12 variables:
## $ House.Price : num 232500 470000 150000 167500 210000 ...
## $ HP.in.thousands: num 232 470 150 168 210 ...
## $ House.Size : int 1679 4494 2542 1094 1838 4156 3222 1556 2846 1596 ...
## $ Acres : num 0.23 0.52 0.11 0.18 0.19 0.22 0.16 0.16 1.08 0.04 ...
## $ Lot.Size : num 10019 22651 4792 7841 8276 ...
## $ Bedrooms : int 3 5 4 2 4 3 4 2 3 3 ...
## $ T.Bath : num 1.5 4 0 1 2 3.5 2.5 2 2 2.5 ...
## $ Age : int 35 38 5 65 33 3 6 34 43 9 ...
## $ Garage : int 1 1 1 0 1 1 1 1 1 1 ...
## $ Condition : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Age.Category : Factor w/ 4 levels "M","N","O","VO": 1 1 2 3 1 2 2 1 1 2 ...
## $ House.Age : int 3 3 4 2 3 4 4 3 3 4 ...
# converting prices to numeric format
d$House.Price <- as.numeric(gsub("[$,]","", d$House.Price))
Create Binary Variable
Creating a binary variable price01
that contains 1 if House.Price
is above its median, and 0 otherwise.
d$price01 <- ifelse(d$House.Price > median(d$House.Price), 1, 0)
table(d$price01)
##
## 0 1
## 102 98
Exploring Relationship
Exploring relationships with price01
variable, and trying to identify important variables to predict price01
.
boxplot(House.Price ~ price01, data = d, xlab = "price01", ylab = "House.Price")
boxplot(HP.in.thousands ~ price01, data = d, xlab = "price01", ylab = "HP.in.thousands")
The categories are clearly separated, since data is splitted by median value.
boxplot(House.Size ~ price01, data = d, xlab = "price01", ylab = "House.Size")
Larger houses are typically pricier.
boxplot(Acres ~ price01, data = d, xlab = "price01", ylab = "Acres")
boxplot(Lot.Size ~ price01, data = d, xlab = "price01", ylab = "Lot.Size")
boxplot(log(Acres) ~ price01, data = d, xlab = "price01", ylab = "log(Acres)")
More expensive houses tend to have slightly larger lot size. However, here we have severe skew of the distributions, a lot of outliers. Taking the log partly solves the issue but probably these are not good predictors since their distributions are far from Normal.
table(d$Bedrooms, d$price01)
##
## 0 1
## 0 4 3
## 1 3 2
## 2 19 4
## 3 67 49
## 4 7 28
## 5 2 9
## 6 0 1
## 8 0 2
boxplot(Bedrooms ~ price01, data = d, xlab = "price01", ylab = "Bedrooms")
More expensive houses tend to have higher number of bedrooms.
table(d$T.Bath, d$price01)
##
## 0 1
## 0 4 2
## 1 29 2
## 1.5 20 4
## 2 31 30
## 2.5 18 31
## 3 0 18
## 3.5 0 9
## 4 0 2
boxplot(T.Bath ~ price01, data = d, xlab = "price01", ylab = "T.Bath")
More expensive houses tend to have higher number of bathrooms.
boxplot(Age ~ price01, data = d, xlab = "price01", ylab = "Age")
Median age for more expensive houses is slightly lower.
table(d$Garage, d$price01)
##
## 0 1
## 0 25 22
## 1 77 76
The distribution is about the same in both categories.
table(d$Condition, d$price01)
##
## 0 1
## 0 90 84
## 1 12 14
The distribution is about the same in both categories.
table(d$Age.Category, d$price01)
##
## 0 1
## M 42 30
## N 31 47
## O 25 12
## VO 4 9
boxplot(House.Age ~ price01, data = d, xlab = "price01", ylab = "House.Age")
Instead of categorical age predictors we will use more complete predictor, Age
.
So important predictors are House.Size
, Bedrooms
, T.Bath
, and Age
.
Note - House.Price
and HP.in.thousands
were not used, since the price is assumed to be unknown, and we need to predict it.
Creating Training and Test Sets
Splitting the data to train and test sets (150 - train, 50 - test).
set.seed(28)
ids <- sample(x = 200, size = 150, replace = F)
train <- d[ids,]
test <- d[-ids,]
LDA
Performing LDA.
lda.fit <- lda(price01 ~ House.Size + Bedrooms + T.Bath + Age, data = train)
lda.fit
## Call:
## lda(price01 ~ House.Size + Bedrooms + T.Bath + Age, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.553333 0.446667
##
## Group means:
## House.Size Bedrooms T.Bath Age
## 0 1793.86 2.75904 1.62048 35.8795
## 1 3419.82 3.32836 2.49254 32.8955
##
## Coefficients of linear discriminants:
## LD1
## House.Size 0.000634223
## Bedrooms -0.202447573
## T.Bath 1.206030974
## Age 0.010986818
lda.pred <- predict(lda.fit, test)
lda.class <- lda.pred$class
# confusion matrix
table(lda.class, test$price01)
##
## lda.class 0 1
## 0 19 10
## 1 0 21
# test error rate
mean(lda.class != test$price01)
## [1] 0.2
Test error rate is 0.2.
QDA
Performing QDA.
qda.fit <- qda(price01 ~ House.Size + Bedrooms + T.Bath + Age, data = train)
qda.fit
## Call:
## qda(price01 ~ House.Size + Bedrooms + T.Bath + Age, data = train)
##
## Prior probabilities of groups:
## 0 1
## 0.553333 0.446667
##
## Group means:
## House.Size Bedrooms T.Bath Age
## 0 1793.86 2.75904 1.62048 35.8795
## 1 3419.82 3.32836 2.49254 32.8955
qda.pred <- predict(qda.fit, test)
qda.class <- qda.pred$class
# confusion matrix
table(qda.class, test$price01)
##
## qda.class 0 1
## 0 18 9
## 1 1 22
# test error rate
mean(qda.class != test$price01)
## [1] 0.2
Test error rate is 0.2, the same as with LDA but confusion matrix is slightly different.
Logistic Regression
Performing logistic regression.
log.fit <- glm(price01 ~ House.Size + Bedrooms + T.Bath + Age,
data = train, family = "binomial")
summary(log.fit)
##
## Call:
## glm(formula = price01 ~ House.Size + Bedrooms + T.Bath + Age,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.687 -0.523 -0.144 0.288 2.040
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.597165 1.680354 -5.12 3.1e-07 ***
## House.Size 0.002360 0.000516 4.57 4.8e-06 ***
## Bedrooms -0.898709 0.446739 -2.01 0.0443 *
## T.Bath 2.343888 0.821734 2.85 0.0043 **
## Age 0.022850 0.013815 1.65 0.0981 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 206.234 on 149 degrees of freedom
## Residual deviance: 96.851 on 145 degrees of freedom
## AIC: 106.9
##
## Number of Fisher Scoring iterations: 6
log.pred <- predict(log.fit, test, type = "response") # predict on 0..1 scale
log.class <- ifelse(log.pred > 0.5, 1, 0)
# confusion matrix
table(log.class, test$price01)
##
## log.class 0 1
## 0 15 8
## 1 4 23
# test error rate
mean(log.class != test$price01)
## [1] 0.24
Test error rate is 0.24, higher than with previous methods.
Note that if we use House.Price
variable, we will get some issues.
log.fit <- glm(price01 ~ House.Price + House.Size + Bedrooms + T.Bath + Age,
data = train, family = "binomial")
summary(log.fit)
##
## Call:
## glm(formula = price01 ~ House.Price + House.Size + Bedrooms +
## T.Bath + Age, family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.000266 0.000000 0.000000 0.000000 0.000330
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.50e+03 3.08e+05 0.00 1.00
## House.Price 6.26e-03 6.72e-01 0.01 0.99
## House.Size -1.06e-02 4.28e+00 0.00 1.00
## Bedrooms 1.86e+00 9.01e+04 0.00 1.00
## T.Bath -1.21e+01 1.64e+04 0.00 1.00
## Age 1.09e+00 2.72e+02 0.00 1.00
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2.0623e+02 on 149 degrees of freedom
## Residual deviance: 2.4250e-07 on 144 degrees of freedom
## AIC: 12
##
## Number of Fisher Scoring iterations: 25
Received warning messages about non-convergence produced, since the issue is with
variable House.Price
provides perfect separation. We can see that max value from 0 category is less than min value from 1 category:
max(train[train$price01 == 0,]$House.Price)
## [1] 242500
min(train[train$price01 == 1,]$House.Price)
## [1] 244000
let’s check the model predictions.
log.pred <- predict(log.fit, test, type = "response") # predict on 0..1 scale
log.class <- ifelse(log.pred > 0.5, 1, 0)
# confusion matrix
table(log.class, test$price01)
##
## log.class 0 1
## 0 19 0
## 1 0 31
# test error rate
mean(log.class != test$price01)
## [1] 0
The prediction is perfect in this case.
Conclusion
LDA and QDA showed the same test error rate (0.2). Logistic regression method showed slightly higher test error rate (0.24). So LDA and QDA models are more appropriate in this case.B