We will be using a data set containing bostons housing prices.
library(MASS)
data(Boston)
The code below will generate test and training sets for the Boston housing data. We will use these data sets to estimate logitistic models predicting whether a neighborhood has pricey or non-pricey homes.
# Create a binary outcome for pricey home
# medv column: median value of owner-occupied homes in $1000s.If the value is greater than 35 than the home is considered a pricey home, else than it is not.
Boston$PriceyHome <- ifelse(Boston$medv > 35, 1, 0) # converting chas into a factor
Boston$chas <- factor(Boston$chas)
set.seed(2020)
# Training set is 75% of the data
trainSize <- 0.75
train_idx <- sample(1:nrow(Boston), size = nrow(Boston) * trainSize, replace=FALSE)
housing_train <- Boston[train_idx,]
housing_test <- Boston[-train_idx,]
Using the summaryBy command (from doBy package) we will generate the average neighborhood diffrences across pricey and non-pricey homes over PriceyHome column.Where do pricey homes differ the most from non-pricey homes?
library(doBy)
summaryBy(. ~ PriceyHome, data = housing_train, FUN = mean)
## PriceyHome crim.mean zn.mean indus.mean nox.mean rm.mean age.mean dis.mean
## 1 0 3.9723238 10.18421 11.526316 0.5553070 6.157643 68.32076 3.842112
## 2 1 0.9861332 23.77027 6.517838 0.5178108 7.541730 61.67568 3.687368
## rad.mean tax.mean ptratio.mean black.mean lstat.mean medv.mean
## 1 9.672515 416.0322 18.63947 356.1615 13.300058 20.35702
## 2 6.135135 306.3514 16.15676 387.2238 4.701892 43.58108
We have set 1 as the label for a pricey home and 0 as the label for a non-pricey home. Pricey and non-pricey homes differ the most in the following areas: - tax: full-value property-tax rate per $10,000. - medv: median value of owner-occupied homes in $1000s. - black: proportion of blacks by town.
Now we will investigate, with data visualizations using (using ggplot) What conclusions do you draw from these plots?
library(ggplot2)
help(Boston)
ggplot(housing_train, aes(x = rm, y= crim)) + geom_point(aes(color = PriceyHome)) + labs(title = "Average number of rooms per dwelling (rm)\n VSPer capita crime rate by town (crim)")
ggplot(housing_train, aes(x = age, y= rm)) + geom_point(aes(color = PriceyHome)) + labs(title = "Proportion of owner-occupied units built prior to 1940 (age)\nVS Average number of rooms per dwelling (rm)")
ggplot(housing_train, aes(x = medv, y= ptratio)) + geom_point(aes(color = PriceyHome)) + labs(title = "Median value of owner-occupied homes in $1000s (medv)\nVS Pupil-teacher ratio by town (ptratio)")
Now we will estimate a logistic model with PriceyHome (1 = pricey home, 0 = non-pricey) on the left hand side (dependent variable or outcome) and a Charles River dummy (chas) on the right hand side (independent variable or predictor). How do we interpret the coefficient on chas. What impact does being adjacent to the Charles River have on a home’s value?
logit_fit1 <- glm(PriceyHome ~ chas,
family = binomial,
data = housing_train)
summary(logit_fit1)
##
## Call:
## glm(formula = PriceyHome ~ chas, family = binomial, data = housing_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3671 0.1909 -12.397 < 2e-16 ***
## chas1 1.2220 0.4741 2.578 0.00995 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 242.43 on 378 degrees of freedom
## Residual deviance: 236.81 on 377 degrees of freedom
## AIC: 240.81
##
## Number of Fisher Scoring iterations: 5
# interpretation
exp(logit_fit1$coefficients)
## (Intercept) chas1
## 0.093750 3.393939
#Pricey homes are .39 more chance of being next to the Charles river.
This model 1 (logit_fit1) is a logistic model between pricey homes and Charles River. This model gives us a coefficient 1.2220 for the charles river. We interpret this coefficient by taking it to the power of e (e^1.2220) this gives us 3.39, which means that pricey homes are have a .39 higher chance of being next to the charles river.
Estimate another model predicting whether a home is pricey as a function of chas, crim, lstat, ptratio, indus, zn, rm, tax, rad and nox. How do we interpret the magnitude of the coefficients for chas. What do you conclude now about the amenity impact of living close to the Charles River?
# Model 2: PriceyHome as a function of chas, crim, lstat, ptatio, indus, zn, rm, tax, rad, nox
logit_fit2 <- glm(PriceyHome ~ chas + crim + lstat + ptratio + indus + zn + rm + tax + rad + nox,
family = binomial,
data = housing_train)
summary(logit_fit2)
##
## Call:
## glm(formula = PriceyHome ~ chas + crim + lstat + ptratio + indus +
## zn + rm + tax + rad + nox, family = binomial, data = housing_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.812670 7.571636 -1.032 0.302150
## chas1 0.182512 0.901390 0.202 0.839542
## crim 0.041776 0.044905 0.930 0.352210
## lstat -0.538425 0.163219 -3.299 0.000971 ***
## ptratio -0.379119 0.205944 -1.841 0.065639 .
## indus -0.027751 0.107689 -0.258 0.796641
## zn -0.008656 0.012261 -0.706 0.480214
## rm 2.108356 0.628140 3.357 0.000789 ***
## tax -0.012023 0.006158 -1.952 0.050904 .
## rad 0.314537 0.128437 2.449 0.014327 *
## nox 7.454055 5.797804 1.286 0.198559
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 242.434 on 378 degrees of freedom
## Residual deviance: 84.571 on 368 degrees of freedom
## AIC: 106.57
##
## Number of Fisher Scoring iterations: 9
# interpretation
exp(logit_fit2$coefficients)
## (Intercept) chas1 crim lstat ptratio indus
## 4.045763e-04 1.200229e+00 1.042661e+00 5.836666e-01 6.844645e-01 9.726304e-01
## zn rm tax rad nox
## 9.913817e-01 8.234691e+00 9.880489e-01 1.369625e+00 1.726852e+03
In model 2 we are able to see that the effect of the Charles River and pricey homes are exp(1.2002290588) = 3.320878. Which means that the effect of living next to the Charles 1-.32= .68 has gone up by approx 20%.
Using the predict function generate probability scores and class predictions (using a cutoff of 0.5) in the test and training sets.
# to be able to run the predictions for the test set we have to have a logistic reg for the test set
logit_fit3 <- glm(PriceyHome ~ chas + crim + lstat + ptratio + indus + zn + rm + tax + rad + nox,
family = binomial,
data = housing_test)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
preds_df_train <- data.frame(scores_model2 = predict(logit_fit2, type = "response"),
data = housing_train)
preds_df_test <- data.frame(scores_model3test = predict(logit_fit3, type = "response"),
data = housing_test)
Calculate confusion matrices for the test and training sets. Calculate the accuracy, true positives, true negatives, sensitivity, specificity and false positive rate of the model in the test and training set.
# create the confusion matrix for different cutoffs
preds_DFM2 <- data.frame(classM2_preds05 = ifelse(preds_df_train$scores_model2 > 0.5, 1, 0),
preds_df_train)
preds_DFM2_test <- data.frame(classM2t_preds05 = ifelse(preds_df_test$scores_model3test > 0.5, 1, 0),
preds_df_test)
# use table for confusion matrix
table(preds_DFM2$classM2_preds05,preds_DFM2$data.PriceyHome)
##
## 0 1
## 0 339 11
## 1 3 26
table(preds_DFM2_test$classM2t_preds05, preds_DFM2_test$data.PriceyHome)
##
## 0 1
## 0 114 4
## 1 2 7
# calculate sensitivity and specificity for Model 2 training set
# sensitivity
(sen<- 26/(26+11))
## [1] 0.7027027
# specificity :TN/N
(spe <- 3/(3+339))
## [1] 0.00877193
# calculate sensitivity and specificity for Model 2 test set
# sensitivity
(sent<- 114/(114+4))
## [1] 0.9661017
# specificity :TN/N
(spet <- 2/(2+7))
## [1] 0.2222222
Based on the accuracy scores above, how should we adjust the probability cutoff? Are there any other factors we should consider when adjusting this cutoff?
We should lower the the cut off rate. As we can see is the in the ROC curve in the next section.
Plot the ROC curve for the test and training sets.
## Warning: The following aesthetics were dropped during statistical transformation: m, d
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## Warning: The following aesthetics were dropped during statistical transformation: m, d
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
Calculate the AUC for the models. What do you conclude regarding whether the model is over or underfit? How would you recommend adjusting the model if it is over or underfit?
# calcuate AUC (area under curve) for both models
calc_auc(mod2_roc) # best case senary if it is close to 1
## Warning: The following aesthetics were dropped during statistical transformation: m, d
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## PANEL group AUC
## 1 1 -1 0.9747116
# 45* line is bad
calc_auc(mod3test_roc)
## Warning: The following aesthetics were dropped during statistical transformation: m, d
## ℹ This can happen when ggplot fails to infer the correct grouping structure in
## the data.
## ℹ Did you forget to specify a `group` aesthetic or to convert a numerical
## variable into a factor?
## PANEL group AUC
## 1 1 -1 0.9835423
# auc higher -> better
Both AUC are close to 1 and that is the best outcome. We should not adjust the model.