Predicting Expensive Houses

\(\underline{Data \space Set}\)

We will be using a data set containing bostons housing prices.

library(MASS) 
data(Boston)

The code below will generate test and training sets for the Boston housing data. We will use these data sets to estimate logitistic models predicting whether a neighborhood has pricey or non-pricey homes.

\(\underline{Creating \space the \space Test \space and \space Training \space Sets}\)

# Create a binary outcome for pricey home
# medv column: median value of owner-occupied homes in $1000s.If the value is greater than 35 than the home is considered a pricey home, else than it is not.
Boston$PriceyHome <- ifelse(Boston$medv > 35, 1, 0) # converting chas into a factor
Boston$chas <- factor(Boston$chas)
set.seed(2020)
# Training set is 75% of the data 
trainSize <- 0.75
train_idx <- sample(1:nrow(Boston), size = nrow(Boston) * trainSize, replace=FALSE) 
housing_train <- Boston[train_idx,]
housing_test <- Boston[-train_idx,]

\(\underline{Data \space Exploration}\)

Using the summaryBy command (from doBy package) we will generate the average neighborhood diffrences across pricey and non-pricey homes over PriceyHome column.Where do pricey homes differ the most from non-pricey homes?

library(doBy)
summaryBy(. ~ PriceyHome, data = housing_train, FUN = mean)
##   PriceyHome crim.mean  zn.mean indus.mean  nox.mean  rm.mean age.mean dis.mean
## 1          0 3.9723238 10.18421  11.526316 0.5553070 6.157643 68.32076 3.842112
## 2          1 0.9861332 23.77027   6.517838 0.5178108 7.541730 61.67568 3.687368
##   rad.mean tax.mean ptratio.mean black.mean lstat.mean medv.mean
## 1 9.672515 416.0322     18.63947   356.1615  13.300058  20.35702
## 2 6.135135 306.3514     16.15676   387.2238   4.701892  43.58108

\(Interpertation\)

We have set 1 as the label for a pricey home and 0 as the label for a non-pricey home. Pricey and non-pricey homes differ the most in the following areas: - tax: full-value property-tax rate per $10,000. - medv: median value of owner-occupied homes in $1000s. - black: proportion of blacks by town.

\(\underline{Investigating \space with \space plots}\)

Now we will investigate, with data visualizations using (using ggplot) What conclusions do you draw from these plots?

library(ggplot2)
help(Boston)
ggplot(housing_train, aes(x = rm, y= crim)) + geom_point(aes(color = PriceyHome)) + labs(title = "Average number of rooms per dwelling (rm)\n VSPer capita crime rate by town (crim)")

ggplot(housing_train, aes(x = age, y= rm)) + geom_point(aes(color = PriceyHome)) + labs(title = "Proportion of owner-occupied units built prior to 1940 (age)\nVS Average number of rooms per dwelling (rm)")

ggplot(housing_train, aes(x = medv, y= ptratio)) + geom_point(aes(color = PriceyHome)) + labs(title = "Median value of owner-occupied homes in $1000s (medv)\nVS Pupil-teacher ratio by town (ptratio)")

\(Interperiation\)

  • Rooms per dwelling (rm) vs crime rates (crim) For non-pricey homes the crime rate has a positive correlation. For example the more rooms in the dwelling the higher the crime rate.However for pricey homes there is a negative correlation, the more rooms the lower the crime rate.
  • Proportion of owner-occupied units built prior to 1940 (age) vs Average number of rooms per dwelling (rm) The proportion of units built prior to 1940 doesn’t affect the amount of rooms in the dwelling. Most non-pricey homes have between 5-7 rooms and most pricey homes have between 7-9 rooms.
  • Median value of owner-occupied homes in $1000s (medv) vs Pupil-teacher ratio by town (ptratio) Most pricey homes have a pupil-teacher ratio between 15-23 while most pricey homes have a ratio of 12-17.5.

\(\underline{Model \space 1: Estimateing \space a \space Logistic \space Model}\)

Now we will estimate a logistic model with PriceyHome (1 = pricey home, 0 = non-pricey) on the left hand side (dependent variable or outcome) and a Charles River dummy (chas) on the right hand side (independent variable or predictor). How do we interpret the coefficient on chas. What impact does being adjacent to the Charles River have on a home’s value?

logit_fit1 <- glm(PriceyHome ~ chas, 
                     family = binomial,
                     data = housing_train)
summary(logit_fit1)
## 
## Call:
## glm(formula = PriceyHome ~ chas, family = binomial, data = housing_train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3671     0.1909 -12.397  < 2e-16 ***
## chas1         1.2220     0.4741   2.578  0.00995 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 242.43  on 378  degrees of freedom
## Residual deviance: 236.81  on 377  degrees of freedom
## AIC: 240.81
## 
## Number of Fisher Scoring iterations: 5
# interpretation
exp(logit_fit1$coefficients)
## (Intercept)       chas1 
##    0.093750    3.393939
#Pricey homes are .39 more chance of being next to the Charles river. 

\(Interperiation\)

This model 1 (logit_fit1) is a logistic model between pricey homes and Charles River. This model gives us a coefficient 1.2220 for the charles river. We interpret this coefficient by taking it to the power of e (e^1.2220) this gives us 3.39, which means that pricey homes are have a .39 higher chance of being next to the charles river.

\(\underline{Model \space 2: Estimateing \space a \space Logistic \space Model}\)

Estimate another model predicting whether a home is pricey as a function of chas, crim, lstat, ptratio, indus, zn, rm, tax, rad and nox. How do we interpret the magnitude of the coefficients for chas. What do you conclude now about the amenity impact of living close to the Charles River?

# Model 2: PriceyHome as a function of chas, crim, lstat, ptatio, indus, zn, rm, tax, rad, nox
logit_fit2 <- glm(PriceyHome ~ chas + crim + lstat + ptratio + indus + zn + rm + tax + rad + nox, 
                  family = binomial,
                  data = housing_train)
summary(logit_fit2)
## 
## Call:
## glm(formula = PriceyHome ~ chas + crim + lstat + ptratio + indus + 
##     zn + rm + tax + rad + nox, family = binomial, data = housing_train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.812670   7.571636  -1.032 0.302150    
## chas1        0.182512   0.901390   0.202 0.839542    
## crim         0.041776   0.044905   0.930 0.352210    
## lstat       -0.538425   0.163219  -3.299 0.000971 ***
## ptratio     -0.379119   0.205944  -1.841 0.065639 .  
## indus       -0.027751   0.107689  -0.258 0.796641    
## zn          -0.008656   0.012261  -0.706 0.480214    
## rm           2.108356   0.628140   3.357 0.000789 ***
## tax         -0.012023   0.006158  -1.952 0.050904 .  
## rad          0.314537   0.128437   2.449 0.014327 *  
## nox          7.454055   5.797804   1.286 0.198559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 242.434  on 378  degrees of freedom
## Residual deviance:  84.571  on 368  degrees of freedom
## AIC: 106.57
## 
## Number of Fisher Scoring iterations: 9
# interpretation
exp(logit_fit2$coefficients)
##  (Intercept)        chas1         crim        lstat      ptratio        indus 
## 4.045763e-04 1.200229e+00 1.042661e+00 5.836666e-01 6.844645e-01 9.726304e-01 
##           zn           rm          tax          rad          nox 
## 9.913817e-01 8.234691e+00 9.880489e-01 1.369625e+00 1.726852e+03

\(Interperiation\)

In model 2 we are able to see that the effect of the Charles River and pricey homes are exp(1.2002290588) = 3.320878. Which means that the effect of living next to the Charles 1-.32= .68 has gone up by approx 20%.

\(\underline{Generating \space Probibility \space Scores}\)

Using the predict function generate probability scores and class predictions (using a cutoff of 0.5) in the test and training sets.

# to be able to run the predictions for the test set we have to have a logistic reg for the test set 
logit_fit3 <- glm(PriceyHome ~ chas + crim + lstat + ptratio + indus + zn + rm + tax + rad + nox, 
                  family = binomial, 
                  data = housing_test)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
preds_df_train <- data.frame(scores_model2 = predict(logit_fit2, type = "response"), 
                             data = housing_train)

preds_df_test <- data.frame(scores_model3test = predict(logit_fit3, type = "response"), 
                            data = housing_test)

\(\underline{Calculating \space Confusion \space Matrices}\)

Calculate confusion matrices for the test and training sets. Calculate the accuracy, true positives, true negatives, sensitivity, specificity and false positive rate of the model in the test and training set.

# create the confusion matrix for different cutoffs
preds_DFM2 <- data.frame(classM2_preds05 = ifelse(preds_df_train$scores_model2 > 0.5, 1, 0),
                         preds_df_train)
preds_DFM2_test <- data.frame(classM2t_preds05 = ifelse(preds_df_test$scores_model3test > 0.5, 1, 0),
                              preds_df_test)

# use table for confusion matrix
table(preds_DFM2$classM2_preds05,preds_DFM2$data.PriceyHome)
##    
##       0   1
##   0 339  11
##   1   3  26
table(preds_DFM2_test$classM2t_preds05, preds_DFM2_test$data.PriceyHome)
##    
##       0   1
##   0 114   4
##   1   2   7
# calculate sensitivity and specificity for Model 2 training set 
# sensitivity 
(sen<- 26/(26+11))
## [1] 0.7027027
# specificity :TN/N
(spe <- 3/(3+339))
## [1] 0.00877193
# calculate sensitivity and specificity for Model 2 test set 
# sensitivity 
(sent<- 114/(114+4))
## [1] 0.9661017
# specificity :TN/N
(spet <- 2/(2+7))
## [1] 0.2222222

\(\underline{Question}\)

Based on the accuracy scores above, how should we adjust the probability cutoff? Are there any other factors we should consider when adjusting this cutoff?

\(Interperiation\)

We should lower the the cut off rate. As we can see is the in the ROC curve in the next section.

\(\underline{Plot \space ROC \space Curve}\)

Plot the ROC curve for the test and training sets.

## Warning: The following aesthetics were dropped during statistical transformation: m, d
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

## Warning: The following aesthetics were dropped during statistical transformation: m, d
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?

\(\underline{Calculating \space the \space AUC}\)

Calculate the AUC for the models. What do you conclude regarding whether the model is over or underfit? How would you recommend adjusting the model if it is over or underfit?

# calcuate AUC (area under curve) for both models
calc_auc(mod2_roc) # best case senary if it is close to 1 
## Warning: The following aesthetics were dropped during statistical transformation: m, d
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
##   PANEL group       AUC
## 1     1    -1 0.9747116
# 45* line is bad 
calc_auc(mod3test_roc) 
## Warning: The following aesthetics were dropped during statistical transformation: m, d
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
##   the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
##   variable into a factor?
##   PANEL group       AUC
## 1     1    -1 0.9835423
# auc higher -> better

\(Interperiation\)

Both AUC are close to 1 and that is the best outcome. We should not adjust the model.