For this assignment, we were tasked with building a binary logistic regression model from a dataset containing information on crime in various neighborhoods of a major city. Given a vector of predictors, we seek to predict whether the neighborhood crime rate is above the median.
The dataset is composed of 466 observations and 12 predictor variables. The response variable target is binary (0 or 1). A quick look at the distribution of the training dataset reveals some skewed predictors. All observations in this dataset are complete.
Name | data_train |
Number of rows | 466 |
Number of columns | 13 |
_______________________ | |
Column type frequency: | |
numeric | 13 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
zn | 0 | 1 | 11.58 | 23.36 | 0.00 | 0.00 | 0.00 | 16.25 | 100.00 | ▇▁▁▁▁ |
indus | 0 | 1 | 11.11 | 6.85 | 0.46 | 5.15 | 9.69 | 18.10 | 27.74 | ▇▆▁▇▁ |
chas | 0 | 1 | 0.07 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
nox | 0 | 1 | 0.55 | 0.12 | 0.39 | 0.45 | 0.54 | 0.62 | 0.87 | ▇▇▅▃▁ |
rm | 0 | 1 | 6.29 | 0.70 | 3.86 | 5.89 | 6.21 | 6.63 | 8.78 | ▁▂▇▂▁ |
age | 0 | 1 | 68.37 | 28.32 | 2.90 | 43.88 | 77.15 | 94.10 | 100.00 | ▂▂▂▃▇ |
dis | 0 | 1 | 3.80 | 2.11 | 1.13 | 2.10 | 3.19 | 5.21 | 12.13 | ▇▅▂▁▁ |
rad | 0 | 1 | 9.53 | 8.69 | 1.00 | 4.00 | 5.00 | 24.00 | 24.00 | ▇▂▁▁▃ |
tax | 0 | 1 | 409.50 | 167.90 | 187.00 | 281.00 | 334.50 | 666.00 | 711.00 | ▇▇▅▁▇ |
ptratio | 0 | 1 | 18.40 | 2.20 | 12.60 | 16.90 | 18.90 | 20.20 | 22.00 | ▁▃▅▅▇ |
lstat | 0 | 1 | 12.63 | 7.10 | 1.73 | 7.04 | 11.35 | 16.93 | 37.97 | ▇▇▅▂▁ |
medv | 0 | 1 | 22.59 | 9.24 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | ▂▇▅▁▁ |
target | 0 | 1 | 0.49 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
With a closer look at the distribution of the data using density plots, we can identify bimodal distribution for indus, rar and tax and skew in dis, nox, ptratio and zn.
Looking at the density and box plots of each we can observe that:
data_train %>%
select(-target) %>%
gather() %>%
ggplot(aes(x= value)) +
geom_density(fill='pink') +
facet_wrap(~key, scales = 'free')
data_train_long <- gather(data_train, "Variable", "Value", zn:medv)
data_train_long$target <- as.factor(data_train_long$target)
ggplot(data_train_long, aes(x=target, y=Value)) + geom_boxplot(varwidth = TRUE, alpha=0.2, fill="orange1") + facet_wrap(~Variable, scales = "free")
Looking at the correlation plot of our data set we see the below and confirm some of the observations made from the density and box plots:
Highest correlations amongst predictor variables:
- tax|rad (.91)
-nox|indus (.76)
-age|nox(.74)
-tax|indus (.73)
-medv|rm (.71)
-dis|indus (-.7)
-medv|lstat (-.74)
-dist|age (-.75)
-dis|nox (-.77)
q <- cor(data_train)
ggcorrplot(q, type = "upper", outline.color = "white",
ggtheme = theme_classic,
colors = c("#6D9EC1", "white", "#E46726"),
lab = TRUE, show.legend = FALSE, tl.cex = 8, lab_size = 3)
In preparation of the data to use in the model we remove the chas variable is it does not seem to have any impact on target and is not highly correlated with any other predictor variable.
We also convert the target variable into a factor from an integer type
data_train_m2 <- data_train %>% select(-chas)
#convert target to factor
data_train_m2$target <- as.factor(data_train_m2$target)
In preparation of the data to use in the model we will look to see if within the data set that are any influence points that may have a significant impact on the model.
From the general model below are some outlier observations based on Cooks distance, though not all outlier observations are influential
ilp <- glm(target ~ .,data = data_train_m2, family = binomial(link="logit"))
#Top 5 outliers
plot(ilp, which = 4, id.n = 5)
To see if any outlier observations are influential we plot the standardized residual error to determine if any residuals are above the absolute value of 3.
augment(ilp) %>% mutate(index = 1:n()) %>% ggplot(aes(index, .std.resid)) + geom_point(aes(color=target)) + labs(y= "Standardized Residuals", x="Observation Index")
From the above plot there are is 1 value that is greater than 3. Observation 457. We will remove them from our dataset, we will also remove observation 338 which is close to 3 but not quite
Models in this sections are simply subsets of the full model.
# Initialize a df that will store the metrics of models
models.df <- tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),
aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),
precision.deviance=numeric(), stringsAsFactors=FALSE)
# A function to extract the relevant metrics from the summary and confusion matrix
build_model <- function(id, formula, data) {
glm.fit <- glm(formula, data=data, family=binomial)
print(summary(glm.fit))
glm.probs <- predict(glm.fit, type="response")
# Confirm the 0.5 threshold
glm.pred <- ifelse(glm.probs > 0.5, 1, 0)
results <- tibble(target=data$target, pred=glm.pred)
results <- results %>%
mutate(pred.class = as.factor(pred), target.class = as.factor(target))
#print(confusionMatrix(results$pred.class,results$target.class, positive = "1"))
acc <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$overall['Accuracy']
sens <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Sensitivity']
spec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Specificity']
prec <- confusionMatrix(results$pred.class,results$target.class, positive = "1")$byClass['Precision']
res.deviance <- glm.fit$deviance
null.deviance <- glm.fit$null.deviance
aic <- glm.fit$aic
metrics <- list(res.deviance=res.deviance, null.deviance=null.deviance,aic=aic, accuracy=acc, sensitivity=sens, specificity=spec, precision=prec)
metrics <- lapply(metrics, round, 3)
plot(roc(results$target.class,glm.probs), print.auc = TRUE)
model.df <- tibble(id=id, formula=formula, res.deviance=metrics$res.deviance, null.deviance=metrics$null.deviance,
aic=metrics$aic, accuracy=metrics$accuracy, sensitivity=metrics$sensitivity, specificity=metrics$specificity, precision=metrics$precision)
model.list <- list(model=glm.fit, df_info=model.df)
return(model.list)
}
Model 1 starts with a complete model including all predictors iteratively finds the model with the subset of predictors that are statistically significant.
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8464 -0.1445 -0.0017 0.0029 3.4665
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -40.822934 6.632913 -6.155 7.53e-10 ***
zn -0.065946 0.034656 -1.903 0.05706 .
indus -0.064614 0.047622 -1.357 0.17485
chas 0.910765 0.755546 1.205 0.22803
nox 49.122297 7.931706 6.193 5.90e-10 ***
rm -0.587488 0.722847 -0.813 0.41637
age 0.034189 0.013814 2.475 0.01333 *
dis 0.738660 0.230275 3.208 0.00134 **
rad 0.666366 0.163152 4.084 4.42e-05 ***
tax -0.006171 0.002955 -2.089 0.03674 *
ptratio 0.402566 0.126627 3.179 0.00148 **
lstat 0.045869 0.054049 0.849 0.39608
medv 0.180824 0.068294 2.648 0.00810 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 192.05 on 453 degrees of freedom
AIC: 218.05
Number of Fisher Scoring iterations: 9
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8750 -0.1679 -0.0019 0.0032 3.4391
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -41.649437 6.503781 -6.404 1.51e-10 ***
zn -0.068448 0.034414 -1.989 0.046707 *
indus -0.063406 0.047493 -1.335 0.181854
chas 0.931110 0.757531 1.229 0.219020
nox 48.092158 7.728183 6.223 4.88e-10 ***
age 0.028384 0.011619 2.443 0.014568 *
dis 0.691454 0.218298 3.167 0.001538 **
rad 0.643883 0.159204 4.044 5.25e-05 ***
tax -0.006310 0.002932 -2.152 0.031399 *
ptratio 0.361822 0.113913 3.176 0.001492 **
lstat 0.062518 0.049592 1.261 0.207437
medv 0.137881 0.041423 3.329 0.000873 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 192.71 on 454 degrees of freedom
AIC: 216.71
Number of Fisher Scoring iterations: 9
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8779 -0.1646 -0.0015 0.0030 3.4366
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -40.271093 6.335539 -6.356 2.07e-10 ***
zn -0.074693 0.034472 -2.167 0.030252 *
indus -0.050727 0.045716 -1.110 0.267166
nox 46.245279 7.498336 6.167 6.94e-10 ***
age 0.028882 0.011572 2.496 0.012564 *
dis 0.666663 0.215818 3.089 0.002008 **
rad 0.691985 0.156587 4.419 9.91e-06 ***
tax -0.007032 0.002876 -2.445 0.014490 *
ptratio 0.335525 0.110981 3.023 0.002501 **
lstat 0.070234 0.049073 1.431 0.152367
medv 0.138750 0.041554 3.339 0.000841 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 194.24 on 455 degrees of freedom
AIC: 216.24
Number of Fisher Scoring iterations: 9
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9279 -0.1640 -0.0016 0.0027 3.3989
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -38.786168 6.161170 -6.295 3.07e-10 ***
zn -0.072933 0.033022 -2.209 0.027199 *
nox 43.014846 6.689945 6.430 1.28e-10 ***
age 0.028379 0.011401 2.489 0.012809 *
dis 0.658786 0.214787 3.067 0.002161 **
rad 0.739415 0.152375 4.853 1.22e-06 ***
tax -0.008045 0.002651 -3.035 0.002408 **
ptratio 0.334196 0.111780 2.990 0.002792 **
lstat 0.064814 0.048427 1.338 0.180769
medv 0.138257 0.041630 3.321 0.000897 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 195.51 on 456 degrees of freedom
AIC: 215.51
Number of Fisher Scoring iterations: 9
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8295 -0.1752 -0.0021 0.0032 3.4191
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -37.415922 6.035013 -6.200 5.65e-10 ***
zn -0.068648 0.032019 -2.144 0.03203 *
nox 42.807768 6.678692 6.410 1.46e-10 ***
age 0.032950 0.010951 3.009 0.00262 **
dis 0.654896 0.214050 3.060 0.00222 **
rad 0.725109 0.149788 4.841 1.29e-06 ***
tax -0.007756 0.002653 -2.924 0.00346 **
ptratio 0.323628 0.111390 2.905 0.00367 **
medv 0.110472 0.035445 3.117 0.00183 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 197.32 on 457 degrees of freedom
AIC: 215.32
Number of Fisher Scoring iterations: 9
Here is a summary of the models built in this section.
Model 2 is a model that has the below characteristics:
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.91120 -0.07794 -0.00019 0.00068 2.83800
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -48.713108 7.885103 -6.178 6.50e-10 ***
zn -0.101575 0.041487 -2.448 0.014352 *
indus -0.070315 0.050020 -1.406 0.159802
nox 58.567669 9.306145 6.293 3.11e-10 ***
rm -1.059071 0.803827 -1.318 0.187659
age 0.061760 0.016582 3.725 0.000196 ***
dis 1.056492 0.274292 3.852 0.000117 ***
rad 0.865495 0.181865 4.759 1.95e-06 ***
tax -0.007048 0.003184 -2.214 0.026856 *
ptratio 0.483564 0.141717 3.412 0.000644 ***
lstat -0.015129 0.056362 -0.268 0.788376
medv 0.235034 0.078814 2.982 0.002862 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 643.03 on 463 degrees of freedom
Residual deviance: 169.95 on 452 degrees of freedom
AIC: 193.95
Number of Fisher Scoring iterations: 9
zn indus nox rm age dis rad tax
2.004175 2.496646 4.715557 5.641760 3.043744 4.789220 2.071828 2.075384
ptratio lstat medv
2.574732 2.459107 8.961585
The above model indicates that the predictor mdev can be problematic. mdev is highly correlated with lstat and rm, both of which are not significant in the model, but mdev is significant. When removing lstat and rm from the model we see that this addresses the collinearity issue with all VIF values under 5
#remove rm and lstat from model
m2_b <- build_model('m2_b', "target ~ . -rm - lstat", data = data_train_m2)
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.91283 -0.09883 -0.00026 0.00102 2.90788
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -49.291862 7.725211 -6.381 1.76e-10 ***
zn -0.102530 0.040852 -2.510 0.012080 *
indus -0.066558 0.049644 -1.341 0.180015
nox 55.998677 8.903666 6.289 3.19e-10 ***
age 0.051060 0.012897 3.959 7.52e-05 ***
dis 0.953861 0.252702 3.775 0.000160 ***
rad 0.818482 0.175225 4.671 3.00e-06 ***
tax -0.007207 0.003166 -2.276 0.022836 *
ptratio 0.405181 0.123433 3.283 0.001029 **
medv 0.149231 0.041416 3.603 0.000314 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 643.03 on 463 degrees of freedom
Residual deviance: 171.78 on 454 degrees of freedom
AIC: 191.78
Number of Fisher Scoring iterations: 9
zn indus nox age dis rad tax ptratio
1.932216 2.461747 4.305506 1.906550 4.100342 1.928789 2.030718 1.994571
medv
2.466408
Backward elimination produces the final model:
#continue backwards elimination
#m1c <- build_model('m1c', "target ~ . -indus", data = data_train_mv2)
m2_c <- build_model('m2_c', "target ~ . -rm -lstat -indus", data = data_train_m2)
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.92152 -0.11310 -0.00030 0.00076 2.90631
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -47.265354 7.511377 -6.293 3.12e-10 ***
zn -0.099818 0.038990 -2.560 0.010465 *
nox 51.545583 8.013193 6.433 1.25e-10 ***
age 0.049284 0.012664 3.892 9.96e-05 ***
dis 0.938995 0.250466 3.749 0.000178 ***
rad 0.877730 0.171836 5.108 3.26e-07 ***
tax -0.008424 0.002841 -2.965 0.003029 **
ptratio 0.402108 0.124753 3.223 0.001267 **
medv 0.151465 0.041492 3.650 0.000262 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 643.03 on 463 degrees of freedom
Residual deviance: 173.70 on 455 degrees of freedom
AIC: 191.7
Number of Fisher Scoring iterations: 9
Here we added transformed predictors with significant skew to the base model and iteratively eliminated features that were not statistically significant.
trans_models.df <- tibble(id=character(), formula=character(), res.deviance=numeric(), null.deviance=numeric(),
aic=numeric(), accuracy=numeric(), sensitivity=numeric(), specificity=numeric(),
precision.deviance=numeric(), stringsAsFactors=FALSE)
We proceed to remove the variables that were identified earlier to have low correlation or significant multicollinearity.
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9702 -0.2420 -0.0019 0.0004 3.1989
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -3.447e+02 9.629e+01 -3.580 0.000343 ***
zn -6.386e-02 3.319e-02 -1.924 0.054365 .
indus 2.246e-03 6.007e-02 0.037 0.970181
nox 3.904e+02 1.109e+02 3.522 0.000428 ***
age 4.388e-02 1.206e-02 3.637 0.000275 ***
dis -2.965e+00 8.991e-01 -3.298 0.000972 ***
rad 8.151e-01 1.707e-01 4.775 1.80e-06 ***
tax -6.786e-03 3.136e-03 -2.164 0.030493 *
ptratio 4.289e-01 1.299e-01 3.302 0.000962 ***
medv 1.260e-01 3.962e-02 3.180 0.001471 **
log(dis) 1.543e+01 3.816e+00 4.044 5.25e-05 ***
log(nox) -1.778e+02 5.633e+01 -3.156 0.001601 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 176.19 on 454 degrees of freedom
AIC: 200.19
Number of Fisher Scoring iterations: 9
We see that the transformed predictors that were introduced are collinear with the original predictors. To deal with this issue we remove the original predictors and proceed.
zn indus nox age dis rad tax
1.935681 3.682769 793.082457 1.830976 45.039191 1.773397 2.181768
ptratio medv log(dis) log(nox)
1.958810 2.605745 57.709494 743.376530
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7852 -0.1535 -0.0023 0.0040 3.4072
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.635575 3.473533 -0.471 0.637735
zn -0.045662 0.028819 -1.584 0.113098
indus -0.007702 0.044940 -0.171 0.863930
age 0.035947 0.011302 3.181 0.001470 **
rad 0.682013 0.158560 4.301 1.70e-05 ***
tax -0.005718 0.002838 -2.015 0.043929 *
ptratio 0.354516 0.112526 3.151 0.001630 **
medv 0.128618 0.037983 3.386 0.000709 ***
log(dis) 3.201847 0.875133 3.659 0.000254 ***
log(nox) 25.467685 3.985613 6.390 1.66e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 193.11 on 456 degrees of freedom
AIC: 213.11
Number of Fisher Scoring iterations: 9
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8024 -0.1526 -0.0022 0.0037 3.4053
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.850196 3.234215 -0.572 0.567275
zn -0.045716 0.028709 -1.592 0.111298
age 0.035896 0.011292 3.179 0.001478 **
rad 0.689837 0.152274 4.530 5.89e-06 ***
tax -0.005869 0.002686 -2.185 0.028890 *
ptratio 0.354632 0.112751 3.145 0.001659 **
medv 0.129441 0.037771 3.427 0.000610 ***
log(dis) 3.226553 0.864740 3.731 0.000191 ***
log(nox) 25.305601 3.862591 6.551 5.70e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 193.14 on 457 degrees of freedom
AIC: 211.14
Number of Fisher Scoring iterations: 9
Call:
glm(formula = formula, family = binomial, data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.7287 -0.1640 -0.0074 0.0044 3.2107
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.873364 3.250243 -0.576 0.564361
age 0.036544 0.011193 3.265 0.001095 **
rad 0.675544 0.144596 4.672 2.98e-06 ***
tax -0.006190 0.002579 -2.400 0.016407 *
ptratio 0.400345 0.109666 3.651 0.000262 ***
medv 0.123227 0.037168 3.315 0.000915 ***
log(dis) 2.931249 0.826099 3.548 0.000388 ***
log(nox) 25.875044 3.876728 6.674 2.48e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 196.15 on 458 degrees of freedom
AIC: 212.15
Number of Fisher Scoring iterations: 9
When comparing the full model to its subsets we see that residual deviance tends to increase as predictors are dropped which indicates that the smaller models explain more of the residuals. Additionally, AIC also decreases meaning that less information is lost when using the smaller models. There is a minor accuracy penalty of 0.04% on the testing data when using the smaller model, which is acceptable.
However, the models that used additional transformed features had different results compared to the observations above. This time, both residual deviance and AIC seem to increase with smaller models.
However, we see that mt1 which added the log transformation of the predictors nox
and dis
has the highest accuracy (93.1%) and the lowest AIC (200.19) of the models tested so far. In addition, model mt1 delivers 1.5% more in accuracy than the best model subset without transformations. While similar in accuracy to models mt2 and mt3, mt1 has the highest specificity and precision score which outweigh the minor decrease in sensitivity in our criteria for model selection. Even though we are conscious that adding the transformation of a predictor to a model already containing those predictors introduces multicollinearity, mt1 still delivers the best performance.
Our preference for this model is also confirmed by a likelyhood ratio test comparing mt1 (largest model) with mt4 (smaller model).
When comparing a larger to a smaller model, the null hypothesis is that the coefficients in the larger model are 0. In our case, the small p value suggests that the extra coefficients are not 0 and that we prefer the larger model.
Therefore, our selected final model is mt1 which yields an accuracy on the test set of 93.1%.
Taking a look at the coefficients and marginal effects, we can now estimate the relative effects of each predictor on the odds of the target variable, keeping the other predictors the same. The predictors nox
and log(dis)
have large exponents and seem to have significant impact on the model. Interestingly and counterintuitively, the predictor log(nox) seems to have the largest single effect on whether a neighborhood is has a crime rate above the median.
We can now interpret the marginal effects as follows. For example, with each unit increase in distance to employment centers neighborhoods are 17% less likely to be above the median crime rate. Note that the nox
relative likelyhood is a very percentage, but this is due to the scale of the units. The nitrogen oxides concentrations are measured in part per million and unit change of 1 for this predictor is very large and an extrapolation beyond the domain of the nox
variable. The effect is also modulated by the log(nox)
variable which decreases its impact to some extent.
# Marginal effects
LogitScalar <- mean(dlogis(predict(mt1$model, type="link")))
marg <- LogitScalar * coef(mt1$model)
marg <- as.data.frame(marg)
marg[] <- lapply(marg, function(x) sprintf("%.4f", x))
marg
Looking at the marginal model plots, we see that the for all variables the data follows the expected values of the model, though there is a slight variation in the variables indus and age but not significant. We can therefore say that this is the correct model for the given data.
Based on model mt1, below is the prediction distribution of the test dataset, we see that the distribution is fairly split between the binary variable target
data_test_trans <- data_test %>% select(-chas,-rm,-lstat)
test_predict <- predict(mt1$model, newdata=data_test_trans)
test_predict <- ifelse(test_predict<.5,0,1)
data_test$target <- test_predict
ggplot(data_test, aes(x=index(data_test), y=target, color=factor(target))) + geom_point() +
labs(x="Observation", y="target", title = "Model mt1 Prediction", colour = "target")
test_predict
0 1
22 18