In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set
In this section, we will be exploring the data and familiarizing ourselves with ‘features’ we can use for the model building process. This training data has 466 observations, 1 target variable, and 12 predictor variables. First, we can take a look at the distribution of each predictor variable so we can assess the necessary steps to take in order generate the best fit for the model later on.
par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))
for (col_name in names(train_df)) {
hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
}
par(mfrow = c(1, 1))| column | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| zn | 466 | 11.5772532 | 23.3646511 | 0.00000 | 5.3542781 | 0.0000 | 0.0000 | 100.0000 | 100.0000 | 2.1838409 | 6.842914 | 1.0823466 |
| indus | 466 | 11.1050215 | 6.8458549 | 9.69000 | 10.9082353 | 6.3000 | 0.4600 | 27.7400 | 27.2800 | 0.2894763 | 1.764351 | 0.3171281 |
| chas | 466 | 0.0708155 | 0.2567920 | 0.00000 | 0.0000000 | 0.0000 | 0.0000 | 1.0000 | 1.0000 | 3.3462553 | 12.197425 | 0.0118957 |
| nox | 466 | 0.5543105 | 0.1166667 | 0.53800 | 0.5442684 | 0.0900 | 0.3890 | 0.8710 | 0.4820 | 0.7487369 | 2.976990 | 0.0054045 |
| rm | 466 | 6.2906738 | 0.7048513 | 6.21000 | 6.2570615 | 0.3485 | 3.8630 | 8.7800 | 4.9170 | 0.4808673 | 4.561996 | 0.0326516 |
| age | 466 | 68.3675966 | 28.3213784 | 77.15000 | 70.9553476 | 20.2500 | 2.9000 | 100.0000 | 97.1000 | -0.5795721 | 1.998687 | 1.3119625 |
| dis | 466 | 3.7956929 | 2.1069496 | 3.19095 | 3.5443647 | 1.2913 | 1.1296 | 12.1265 | 10.9969 | 1.0021166 | 3.486917 | 0.0976026 |
| rad | 466 | 9.5300429 | 8.6859272 | 5.00000 | 8.6978610 | 1.0000 | 1.0000 | 24.0000 | 23.0000 | 1.0135395 | 2.147295 | 0.4023678 |
| tax | 466 | 409.5021459 | 167.9000887 | 334.50000 | 401.5080214 | 70.5000 | 187.0000 | 711.0000 | 524.0000 | 0.6614416 | 1.859928 | 7.7778214 |
| ptratio | 466 | 18.3984979 | 2.1968447 | 18.90000 | 18.5970588 | 1.3000 | 12.6000 | 22.0000 | 9.4000 | -0.7567025 | 2.610831 | 0.1017669 |
| lstat | 466 | 12.6314592 | 7.1018907 | 11.35000 | 11.8809626 | 4.7700 | 1.7300 | 37.9700 | 36.2400 | 0.9085092 | 3.518453 | 0.3289887 |
| medv | 466 | 22.5892704 | 9.2396814 | 21.20000 | 21.6304813 | 4.0500 | 5.0000 | 50.0000 | 45.0000 | 1.0801670 | 4.392615 | 0.4280200 |
| target | 466 | 0.4914163 | 0.5004636 | 0.00000 | 0.4893048 | 0.0000 | 0.0000 | 1.0000 | 1.0000 | 0.0343398 | 1.001179 | 0.0231835 |
show_summary <- function(df) {
cat(rep("+", 50), "\n")
cat(paste("DIMENSIONS : (", nrow(df), ", ", ncol(df), ")\n", sep = ""), "\n")
cat(rep("+", 50), "\n")
cat("COLUMNS:\n", "\n")
cat(names(df), "\n")
cat(rep("+", 50), "\n")
cat("DATA INFO:\n", "\n")
cat(sapply(df, class), "\n")
cat(rep("+", 50), "\n")
cat("MISSING VALUES:\n", "\n")
cat(colSums(is.na(df)), "\n")
}
show_summary(train_df)## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## DIMENSIONS : (466, 13)
##
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## COLUMNS:
##
## zn indus chas nox rm age dis rad tax ptratio lstat medv target
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## DATA INFO:
##
## numeric numeric integer numeric numeric numeric numeric integer integer numeric numeric numeric integer
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
## MISSING VALUES:
##
## 0 0 0 0 0 0 0 0 0 0 0 0 0
It seems the data is already fairly clean. There are no missing values in this dataset. All the explanatory variables have the right datatype. The next step is to look for multicollinearity.
# correlation plot
correlation_Matrix <- cor(train_df[, 1:12])
corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
number.cex = 0.7)I want to remove any high correlation anything above 0.7 will be remove to create an absence of multicollinearity. So, the variable dis has high negative correlation with indus, nox, and age. Based on this correlation plot, we will remove the following variables in order to minimize multicollinearity: dis, tax, and medv.
train_df <- train_df |>
mutate(crime = ifelse(target == 1, "high", "low"))
df_2 <- train_df |>
select(indus, nox, age, dis, crime)
ggpairs(data = df_2, columns = 1:4, ggplot2::aes(color = crime))After removing the high correlated variables, we can observe that the updated correlation plot has minimal collinearity between the indepedent predictor variable which is what we want to achieve.
correlation_Matrix <- cor(train_df[, 1:8])
corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
number.cex = 0.7)We start off with a simple logistic model and then we can work our way up to build the best fitting model.
With this model, we will set a base AIC value of 522.46 with only one predictor variable in this case zn variable/feature. We expect the first model to having the worst AIC value.
##
## Call:
## glm(formula = target ~ zn, family = "binomial", data = train_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.54503 0.11170 4.879 1.06e-06 ***
## zn -0.09176 0.01349 -6.804 1.02e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 518.46 on 464 degrees of freedom
## AIC: 522.46
##
## Number of Fisher Scoring iterations: 6
For the second model, we will include all the predictor variables that are linearly independent. We will get a better AIC value of 292 which might not be the best.
##
## Call:
## glm(formula = target ~ ., family = "binomial", data = train_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.640771 3.035549 -2.847 0.00442 **
## zn -0.056399 0.019159 -2.944 0.00324 **
## indus 0.049239 0.027236 1.808 0.07063 .
## chas 0.168888 0.584261 0.289 0.77253
## rm 0.693054 0.349282 1.984 0.04723 *
## age 0.035598 0.008583 4.148 3.36e-05 ***
## rad 0.483674 0.113763 4.252 2.12e-05 ***
## ptratio -0.102886 0.071228 -1.444 0.14861
## lstat 0.043899 0.041213 1.065 0.28680
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 276.49 on 457 degrees of freedom
## AIC: 294.49
##
## Number of Fisher Scoring iterations: 8
After creating the everything_model, we wanted to double check with vif() function to check the collinearity of the variable. The table below shows that there is relatively low correlation between the predictor variables.
## vif.everything_model.
## zn 1.464757
## indus 1.494374
## chas 1.063072
## rm 2.501189
## age 1.560749
## rad 1.142231
## ptratio 1.289813
## lstat 2.698502
## Start: AIC=294.49
## target ~ zn + indus + chas + rm + age + rad + ptratio + lstat
##
## Df Deviance AIC
## - chas 1 276.57 292.57
## - lstat 1 277.62 293.62
## <none> 276.49 294.49
## - ptratio 1 278.61 294.61
## - indus 1 279.81 295.81
## - rm 1 280.49 296.49
## - zn 1 288.67 304.67
## - age 1 297.28 313.28
## - rad 1 367.79 383.79
##
## Step: AIC=292.57
## target ~ zn + indus + rm + age + rad + ptratio + lstat
##
## Df Deviance AIC
## - lstat 1 277.73 291.73
## <none> 276.57 292.57
## - ptratio 1 278.95 292.95
## - indus 1 279.99 293.99
## - rm 1 280.57 294.57
## - zn 1 289.10 303.10
## - age 1 297.37 311.37
## - rad 1 368.22 382.22
##
## Step: AIC=291.73
## target ~ zn + indus + rm + age + rad + ptratio
##
## Df Deviance AIC
## <none> 277.73 291.73
## - ptratio 1 280.04 292.04
## - rm 1 280.74 292.74
## - indus 1 281.64 293.64
## - zn 1 289.48 301.48
## - age 1 312.25 324.25
## - rad 1 373.55 385.55
##
## Call: glm(formula = target ~ zn + indus + rm + age + rad + ptratio,
## family = "binomial", data = train_df)
##
## Coefficients:
## (Intercept) zn indus rm age rad
## -6.83783 -0.05417 0.05294 0.43959 0.03994 0.48680
## ptratio
## -0.10514
##
## Degrees of Freedom: 465 Total (i.e. Null); 459 Residual
## Null Deviance: 645.9
## Residual Deviance: 277.7 AIC: 291.7
## Start: AIC=522.46
## target ~ zn
##
## Df Deviance AIC
## + rad 1 344.89 350.89
## + age 1 407.45 413.45
## + indus 1 432.03 438.03
## + lstat 1 473.70 479.70
## + chas 1 516.06 522.06
## <none> 518.46 522.46
## + ptratio 1 517.42 523.42
## + rm 1 518.37 524.37
##
## Step: AIC=350.89
## target ~ zn + rad
##
## Df Deviance AIC
## + age 1 286.93 294.93
## + indus 1 325.11 333.11
## + ptratio 1 334.65 342.65
## + lstat 1 336.31 344.31
## + chas 1 342.50 350.50
## + rm 1 342.80 350.80
## <none> 344.89 350.89
##
## Step: AIC=294.93
## target ~ zn + rad + age
##
## Df Deviance AIC
## + ptratio 1 283.11 293.11
## + rm 1 284.19 294.19
## + indus 1 284.80 294.80
## <none> 286.93 294.93
## + chas 1 286.28 296.28
## + lstat 1 286.80 296.80
##
## Step: AIC=293.11
## target ~ zn + rad + age + ptratio
##
## Df Deviance AIC
## + indus 1 280.74 292.74
## <none> 283.11 293.11
## + rm 1 281.64 293.64
## + chas 1 282.94 294.94
## + lstat 1 283.11 295.11
##
## Step: AIC=292.74
## target ~ zn + rad + age + ptratio + indus
##
## Df Deviance AIC
## + rm 1 277.73 291.73
## <none> 280.74 292.74
## + lstat 1 280.57 294.57
## + chas 1 280.67 294.67
##
## Step: AIC=291.73
## target ~ zn + rad + age + ptratio + indus + rm
##
## Df Deviance AIC
## <none> 277.73 291.73
## + lstat 1 276.57 292.57
## + chas 1 277.62 293.62
##
## Call: glm(formula = target ~ zn + rad + age + ptratio + indus + rm,
## family = "binomial", data = train_df)
##
## Coefficients:
## (Intercept) zn rad age ptratio indus
## -6.83783 -0.05417 0.48680 0.03994 -0.10514 0.05294
## rm
## 0.43959
##
## Degrees of Freedom: 465 Total (i.e. Null); 459 Residual
## Null Deviance: 645.9
## Residual Deviance: 277.7 AIC: 291.7
With both forward and backward selection, both logit model performs similar with having AIC value of 291.7. Also, both direction gave the formula for target ~ zn + rad + age+ ptratio + indus + rm.
The step model has the lower AIC value meaning that its has the lowest prediction error relative to the other models. The first model, we expected, to have the worst AIC since it was not able to predict a majority of the data. Next, the second model where it had all the variables had a significantly better AIC value which is still not the best model we could have had because the third model using the a step selection process where both direction forward selection and backward elimination yielded the same result which was surprising. In this case, the step model was the model we could have made with predictor variables in this dataset. So, we will be picking the step model to predict the target variable in the eval_df.
step_mod <- glm(target ~ zn + rad + age + ptratio + indus + rm, family = binomial,
data = train_df)
predictions <- predict(step_mod, eval_df, type = "response")
threshold <- 0.5
binary_predictions <- ifelse(predictions >= threshold, 1, 0)
eval_df$target_pred <- binary_predictionsmodel_names <- c("simple", "everything", "step")
aic_values <- c(simple_model$aic, everything_model$aic, step_mod$aic)
kable(cbind(model_names, aic_values), col.names = c("Model Name", "AIC Value")) |>
kable_styling(full_width = T)| Model Name | AIC Value |
|---|---|
| simple | 522.464452017651 |
| everything | 294.486574605442 |
| step | 291.726687224588 |
| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target_pred |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 4.03 | 34.7 | 0 |
| 0 | 8.14 | 0 | 0.538 | 6.096 | 84.5 | 4.4619 | 4 | 307 | 21.0 | 10.26 | 18.2 | 0 |
| 0 | 8.14 | 0 | 0.538 | 6.495 | 94.4 | 4.4547 | 4 | 307 | 21.0 | 12.80 | 18.4 | 0 |
| 0 | 8.14 | 0 | 0.538 | 5.950 | 82.0 | 3.9900 | 4 | 307 | 21.0 | 27.71 | 13.2 | 0 |
| 0 | 5.96 | 0 | 0.499 | 5.850 | 41.5 | 3.9342 | 5 | 279 | 19.2 | 8.77 | 21.0 | 0 |
| 25 | 5.13 | 0 | 0.453 | 5.741 | 66.2 | 7.2254 | 8 | 284 | 19.7 | 13.15 | 18.7 | 0 |
| 25 | 5.13 | 0 | 0.453 | 5.966 | 93.4 | 6.8185 | 8 | 284 | 19.7 | 14.44 | 16.0 | 1 |
| 0 | 4.49 | 0 | 0.449 | 6.630 | 56.1 | 4.4377 | 3 | 247 | 18.5 | 6.53 | 26.6 | 0 |
| 0 | 4.49 | 0 | 0.449 | 6.121 | 56.8 | 3.7476 | 3 | 247 | 18.5 | 8.44 | 22.2 | 0 |
| 0 | 2.89 | 0 | 0.445 | 6.163 | 69.6 | 3.4952 | 2 | 276 | 18.0 | 11.34 | 21.4 | 0 |
| 0 | 25.65 | 0 | 0.581 | 5.856 | 97.0 | 1.9444 | 2 | 188 | 19.1 | 25.41 | 17.3 | 0 |
| 0 | 25.65 | 0 | 0.581 | 5.613 | 95.6 | 1.7572 | 2 | 188 | 19.1 | 27.26 | 15.7 | 0 |
| 0 | 21.89 | 0 | 0.624 | 5.637 | 94.7 | 1.9799 | 4 | 437 | 21.2 | 18.34 | 14.3 | 1 |
| 0 | 19.58 | 0 | 0.605 | 6.101 | 93.0 | 2.2834 | 5 | 403 | 14.7 | 9.81 | 25.0 | 1 |
| 0 | 19.58 | 0 | 0.605 | 5.880 | 97.3 | 2.3887 | 5 | 403 | 14.7 | 12.03 | 19.1 | 1 |
| 0 | 10.59 | 1 | 0.489 | 5.960 | 92.1 | 3.8771 | 4 | 277 | 18.6 | 17.27 | 21.7 | 1 |
| 0 | 6.20 | 0 | 0.504 | 6.552 | 21.4 | 3.3751 | 8 | 307 | 17.4 | 3.76 | 31.5 | 0 |
| 0 | 6.20 | 0 | 0.507 | 8.247 | 70.4 | 3.6519 | 8 | 307 | 17.4 | 3.95 | 48.3 | 1 |
| 22 | 5.86 | 0 | 0.431 | 6.957 | 6.8 | 8.9067 | 7 | 330 | 19.1 | 3.53 | 29.6 | 0 |
| 90 | 2.97 | 0 | 0.400 | 7.088 | 20.8 | 7.3073 | 1 | 285 | 15.3 | 7.85 | 32.2 | 0 |
| 80 | 1.76 | 0 | 0.385 | 6.230 | 31.5 | 9.0892 | 1 | 241 | 18.2 | 12.93 | 20.1 | 0 |
| 33 | 2.18 | 0 | 0.472 | 6.616 | 58.1 | 3.3700 | 7 | 222 | 18.4 | 8.93 | 28.4 | 0 |
| 0 | 9.90 | 0 | 0.544 | 6.122 | 52.8 | 2.6403 | 4 | 304 | 18.4 | 5.98 | 22.1 | 0 |
| 0 | 7.38 | 0 | 0.493 | 6.415 | 40.1 | 4.7211 | 5 | 287 | 19.6 | 6.12 | 25.0 | 0 |
| 0 | 7.38 | 0 | 0.493 | 6.312 | 28.9 | 5.4159 | 5 | 287 | 19.6 | 6.15 | 23.0 | 0 |
| 0 | 5.19 | 0 | 0.515 | 5.895 | 59.6 | 5.6150 | 5 | 224 | 20.2 | 10.56 | 18.5 | 0 |
| 80 | 2.01 | 0 | 0.435 | 6.635 | 29.7 | 8.3440 | 4 | 280 | 17.0 | 5.99 | 24.5 | 0 |
| 0 | 18.10 | 0 | 0.718 | 3.561 | 87.9 | 1.6132 | 24 | 666 | 20.2 | 7.12 | 27.5 | 1 |
| 0 | 18.10 | 1 | 0.631 | 7.016 | 97.5 | 1.2024 | 24 | 666 | 20.2 | 2.96 | 50.0 | 1 |
| 0 | 18.10 | 0 | 0.584 | 6.348 | 86.1 | 2.0527 | 24 | 666 | 20.2 | 17.64 | 14.5 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.935 | 87.9 | 1.8206 | 24 | 666 | 20.2 | 34.02 | 8.4 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.627 | 93.9 | 1.8172 | 24 | 666 | 20.2 | 22.88 | 12.8 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.818 | 92.4 | 1.8662 | 24 | 666 | 20.2 | 22.11 | 10.5 | 1 |
| 0 | 18.10 | 0 | 0.740 | 6.219 | 100.0 | 2.0048 | 24 | 666 | 20.2 | 16.59 | 18.4 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.854 | 96.6 | 1.8956 | 24 | 666 | 20.2 | 23.79 | 10.8 | 1 |
| 0 | 18.10 | 0 | 0.713 | 6.525 | 86.5 | 2.4358 | 24 | 666 | 20.2 | 18.13 | 14.1 | 1 |
| 0 | 18.10 | 0 | 0.713 | 6.376 | 88.4 | 2.5671 | 24 | 666 | 20.2 | 14.65 | 17.7 | 1 |
| 0 | 18.10 | 0 | 0.655 | 6.209 | 65.4 | 2.9634 | 24 | 666 | 20.2 | 13.22 | 21.4 | 1 |
| 0 | 9.69 | 0 | 0.585 | 5.794 | 70.6 | 2.8927 | 6 | 391 | 19.2 | 14.10 | 18.3 | 0 |
| 0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273 | 21.0 | 5.64 | 23.9 | 0 |