Overview

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set

Data Exploration

In this section, we will be exploring the data and familiarizing ourselves with ‘features’ we can use for the model building process. This training data has 466 observations, 1 target variable, and 12 predictor variables. First, we can take a look at the distribution of each predictor variable so we can assess the necessary steps to take in order generate the best fit for the model later on.

Distribution of Each Predictor Variable

par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))

for (col_name in names(train_df)) {
    hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
}

par(mfrow = c(1, 1))

kable(tidy(train_df), "pipe")
column n mean sd median trimmed mad min max range skew kurtosis se
zn 466 11.5772532 23.3646511 0.00000 5.3542781 0.0000 0.0000 100.0000 100.0000 2.1838409 6.842914 1.0823466
indus 466 11.1050215 6.8458549 9.69000 10.9082353 6.3000 0.4600 27.7400 27.2800 0.2894763 1.764351 0.3171281
chas 466 0.0708155 0.2567920 0.00000 0.0000000 0.0000 0.0000 1.0000 1.0000 3.3462553 12.197425 0.0118957
nox 466 0.5543105 0.1166667 0.53800 0.5442684 0.0900 0.3890 0.8710 0.4820 0.7487369 2.976990 0.0054045
rm 466 6.2906738 0.7048513 6.21000 6.2570615 0.3485 3.8630 8.7800 4.9170 0.4808673 4.561996 0.0326516
age 466 68.3675966 28.3213784 77.15000 70.9553476 20.2500 2.9000 100.0000 97.1000 -0.5795721 1.998687 1.3119625
dis 466 3.7956929 2.1069496 3.19095 3.5443647 1.2913 1.1296 12.1265 10.9969 1.0021166 3.486917 0.0976026
rad 466 9.5300429 8.6859272 5.00000 8.6978610 1.0000 1.0000 24.0000 23.0000 1.0135395 2.147295 0.4023678
tax 466 409.5021459 167.9000887 334.50000 401.5080214 70.5000 187.0000 711.0000 524.0000 0.6614416 1.859928 7.7778214
ptratio 466 18.3984979 2.1968447 18.90000 18.5970588 1.3000 12.6000 22.0000 9.4000 -0.7567025 2.610831 0.1017669
lstat 466 12.6314592 7.1018907 11.35000 11.8809626 4.7700 1.7300 37.9700 36.2400 0.9085092 3.518453 0.3289887
medv 466 22.5892704 9.2396814 21.20000 21.6304813 4.0500 5.0000 50.0000 45.0000 1.0801670 4.392615 0.4280200
target 466 0.4914163 0.5004636 0.00000 0.4893048 0.0000 0.0000 1.0000 1.0000 0.0343398 1.001179 0.0231835
show_summary <- function(df) {
    cat(rep("+", 50), "\n")
    cat(paste("DIMENSIONS : (", nrow(df), ", ", ncol(df), ")\n", sep = ""), "\n")
    cat(rep("+", 50), "\n")
    cat("COLUMNS:\n", "\n")
    cat(names(df), "\n")
    cat(rep("+", 50), "\n")
    cat("DATA INFO:\n", "\n")
    cat(sapply(df, class), "\n")
    cat(rep("+", 50), "\n")
    cat("MISSING VALUES:\n", "\n")
    cat(colSums(is.na(df)), "\n")
}

show_summary(train_df)
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## DIMENSIONS : (466, 13)
##  
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## COLUMNS:
##  
## zn indus chas nox rm age dis rad tax ptratio lstat medv target 
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## DATA INFO:
##  
## numeric numeric integer numeric numeric numeric numeric integer integer numeric numeric numeric integer 
## + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
## MISSING VALUES:
##  
## 0 0 0 0 0 0 0 0 0 0 0 0 0

It seems the data is already fairly clean. There are no missing values in this dataset. All the explanatory variables have the right datatype. The next step is to look for multicollinearity.

# correlation plot
correlation_Matrix <- cor(train_df[, 1:12])

corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
    number.cex = 0.7)

I want to remove any high correlation anything above 0.7 will be remove to create an absence of multicollinearity. So, the variable dis has high negative correlation with indus, nox, and age. Based on this correlation plot, we will remove the following variables in order to minimize multicollinearity: dis, tax, and medv.

train_df <- train_df |>
    mutate(crime = ifelse(target == 1, "high", "low"))

df_2 <- train_df |>
    select(indus, nox, age, dis, crime)
ggpairs(data = df_2, columns = 1:4, ggplot2::aes(color = crime))

train_df <- train_df |>
    select(-dis, -tax, -medv, -nox)

Data Preparation

After removing the high correlated variables, we can observe that the updated correlation plot has minimal collinearity between the indepedent predictor variable which is what we want to achieve.

correlation_Matrix <- cor(train_df[, 1:8])

corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
    number.cex = 0.7)

train_df <- train_df |>
    select(-crime)

Model Building

We start off with a simple logistic model and then we can work our way up to build the best fitting model.

Simple/Base Logistic Model

With this model, we will set a base AIC value of 522.46 with only one predictor variable in this case zn variable/feature. We expect the first model to having the worst AIC value.

simple_model <- glm(target ~ zn, data = train_df, family = "binomial")
summary(simple_model)
## 
## Call:
## glm(formula = target ~ zn, family = "binomial", data = train_df)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.54503    0.11170   4.879 1.06e-06 ***
## zn          -0.09176    0.01349  -6.804 1.02e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 518.46  on 464  degrees of freedom
## AIC: 522.46
## 
## Number of Fisher Scoring iterations: 6

Everything Model

For the second model, we will include all the predictor variables that are linearly independent. We will get a better AIC value of 292 which might not be the best.

everything_model <- glm(target ~ ., data = train_df, family = "binomial")
summary(everything_model)
## 
## Call:
## glm(formula = target ~ ., family = "binomial", data = train_df)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -8.640771   3.035549  -2.847  0.00442 ** 
## zn          -0.056399   0.019159  -2.944  0.00324 ** 
## indus        0.049239   0.027236   1.808  0.07063 .  
## chas         0.168888   0.584261   0.289  0.77253    
## rm           0.693054   0.349282   1.984  0.04723 *  
## age          0.035598   0.008583   4.148 3.36e-05 ***
## rad          0.483674   0.113763   4.252 2.12e-05 ***
## ptratio     -0.102886   0.071228  -1.444  0.14861    
## lstat        0.043899   0.041213   1.065  0.28680    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 276.49  on 457  degrees of freedom
## AIC: 294.49
## 
## Number of Fisher Scoring iterations: 8

After creating the everything_model, we wanted to double check with vif() function to check the collinearity of the variable. The table below shows that there is relatively low correlation between the predictor variables.

data.frame(vif(everything_model))
##         vif.everything_model.
## zn                   1.464757
## indus                1.494374
## chas                 1.063072
## rm                   2.501189
## age                  1.560749
## rad                  1.142231
## ptratio              1.289813
## lstat                2.698502

Backward Selection

step(everything_model, direction = "backward", scope = formula(everything_model))
## Start:  AIC=294.49
## target ~ zn + indus + chas + rm + age + rad + ptratio + lstat
## 
##           Df Deviance    AIC
## - chas     1   276.57 292.57
## - lstat    1   277.62 293.62
## <none>         276.49 294.49
## - ptratio  1   278.61 294.61
## - indus    1   279.81 295.81
## - rm       1   280.49 296.49
## - zn       1   288.67 304.67
## - age      1   297.28 313.28
## - rad      1   367.79 383.79
## 
## Step:  AIC=292.57
## target ~ zn + indus + rm + age + rad + ptratio + lstat
## 
##           Df Deviance    AIC
## - lstat    1   277.73 291.73
## <none>         276.57 292.57
## - ptratio  1   278.95 292.95
## - indus    1   279.99 293.99
## - rm       1   280.57 294.57
## - zn       1   289.10 303.10
## - age      1   297.37 311.37
## - rad      1   368.22 382.22
## 
## Step:  AIC=291.73
## target ~ zn + indus + rm + age + rad + ptratio
## 
##           Df Deviance    AIC
## <none>         277.73 291.73
## - ptratio  1   280.04 292.04
## - rm       1   280.74 292.74
## - indus    1   281.64 293.64
## - zn       1   289.48 301.48
## - age      1   312.25 324.25
## - rad      1   373.55 385.55
## 
## Call:  glm(formula = target ~ zn + indus + rm + age + rad + ptratio, 
##     family = "binomial", data = train_df)
## 
## Coefficients:
## (Intercept)           zn        indus           rm          age          rad  
##    -6.83783     -0.05417      0.05294      0.43959      0.03994      0.48680  
##     ptratio  
##    -0.10514  
## 
## Degrees of Freedom: 465 Total (i.e. Null);  459 Residual
## Null Deviance:       645.9 
## Residual Deviance: 277.7     AIC: 291.7

Forward Selection

step(simple_model, direction = "forward", scope = formula(everything_model))
## Start:  AIC=522.46
## target ~ zn
## 
##           Df Deviance    AIC
## + rad      1   344.89 350.89
## + age      1   407.45 413.45
## + indus    1   432.03 438.03
## + lstat    1   473.70 479.70
## + chas     1   516.06 522.06
## <none>         518.46 522.46
## + ptratio  1   517.42 523.42
## + rm       1   518.37 524.37
## 
## Step:  AIC=350.89
## target ~ zn + rad
## 
##           Df Deviance    AIC
## + age      1   286.93 294.93
## + indus    1   325.11 333.11
## + ptratio  1   334.65 342.65
## + lstat    1   336.31 344.31
## + chas     1   342.50 350.50
## + rm       1   342.80 350.80
## <none>         344.89 350.89
## 
## Step:  AIC=294.93
## target ~ zn + rad + age
## 
##           Df Deviance    AIC
## + ptratio  1   283.11 293.11
## + rm       1   284.19 294.19
## + indus    1   284.80 294.80
## <none>         286.93 294.93
## + chas     1   286.28 296.28
## + lstat    1   286.80 296.80
## 
## Step:  AIC=293.11
## target ~ zn + rad + age + ptratio
## 
##         Df Deviance    AIC
## + indus  1   280.74 292.74
## <none>       283.11 293.11
## + rm     1   281.64 293.64
## + chas   1   282.94 294.94
## + lstat  1   283.11 295.11
## 
## Step:  AIC=292.74
## target ~ zn + rad + age + ptratio + indus
## 
##         Df Deviance    AIC
## + rm     1   277.73 291.73
## <none>       280.74 292.74
## + lstat  1   280.57 294.57
## + chas   1   280.67 294.67
## 
## Step:  AIC=291.73
## target ~ zn + rad + age + ptratio + indus + rm
## 
##         Df Deviance    AIC
## <none>       277.73 291.73
## + lstat  1   276.57 292.57
## + chas   1   277.62 293.62
## 
## Call:  glm(formula = target ~ zn + rad + age + ptratio + indus + rm, 
##     family = "binomial", data = train_df)
## 
## Coefficients:
## (Intercept)           zn          rad          age      ptratio        indus  
##    -6.83783     -0.05417      0.48680      0.03994     -0.10514      0.05294  
##          rm  
##     0.43959  
## 
## Degrees of Freedom: 465 Total (i.e. Null);  459 Residual
## Null Deviance:       645.9 
## Residual Deviance: 277.7     AIC: 291.7

With both forward and backward selection, both logit model performs similar with having AIC value of 291.7. Also, both direction gave the formula for target ~ zn + rad + age+ ptratio + indus + rm.


Model Selection

The step model has the lower AIC value meaning that its has the lowest prediction error relative to the other models. The first model, we expected, to have the worst AIC since it was not able to predict a majority of the data. Next, the second model where it had all the variables had a significantly better AIC value which is still not the best model we could have had because the third model using the a step selection process where both direction forward selection and backward elimination yielded the same result which was surprising. In this case, the step model was the model we could have made with predictor variables in this dataset. So, we will be picking the step model to predict the target variable in the eval_df.

step_mod <- glm(target ~ zn + rad + age + ptratio + indus + rm, family = binomial,
    data = train_df)

predictions <- predict(step_mod, eval_df, type = "response")

threshold <- 0.5

binary_predictions <- ifelse(predictions >= threshold, 1, 0)

eval_df$target_pred <- binary_predictions
model_names <- c("simple", "everything", "step")
aic_values <- c(simple_model$aic, everything_model$aic, step_mod$aic)

kable(cbind(model_names, aic_values), col.names = c("Model Name", "AIC Value")) |>
    kable_styling(full_width = T)
Model Name AIC Value
simple 522.464452017651
everything 294.486574605442
step 291.726687224588

Predictions

kable(eval_df) |>
    kable_styling(full_width = T)
zn indus chas nox rm age dis rad tax ptratio lstat medv target_pred
0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 0
0 8.14 0 0.538 6.096 84.5 4.4619 4 307 21.0 10.26 18.2 0
0 8.14 0 0.538 6.495 94.4 4.4547 4 307 21.0 12.80 18.4 0
0 8.14 0 0.538 5.950 82.0 3.9900 4 307 21.0 27.71 13.2 0
0 5.96 0 0.499 5.850 41.5 3.9342 5 279 19.2 8.77 21.0 0
25 5.13 0 0.453 5.741 66.2 7.2254 8 284 19.7 13.15 18.7 0
25 5.13 0 0.453 5.966 93.4 6.8185 8 284 19.7 14.44 16.0 1
0 4.49 0 0.449 6.630 56.1 4.4377 3 247 18.5 6.53 26.6 0
0 4.49 0 0.449 6.121 56.8 3.7476 3 247 18.5 8.44 22.2 0
0 2.89 0 0.445 6.163 69.6 3.4952 2 276 18.0 11.34 21.4 0
0 25.65 0 0.581 5.856 97.0 1.9444 2 188 19.1 25.41 17.3 0
0 25.65 0 0.581 5.613 95.6 1.7572 2 188 19.1 27.26 15.7 0
0 21.89 0 0.624 5.637 94.7 1.9799 4 437 21.2 18.34 14.3 1
0 19.58 0 0.605 6.101 93.0 2.2834 5 403 14.7 9.81 25.0 1
0 19.58 0 0.605 5.880 97.3 2.3887 5 403 14.7 12.03 19.1 1
0 10.59 1 0.489 5.960 92.1 3.8771 4 277 18.6 17.27 21.7 1
0 6.20 0 0.504 6.552 21.4 3.3751 8 307 17.4 3.76 31.5 0
0 6.20 0 0.507 8.247 70.4 3.6519 8 307 17.4 3.95 48.3 1
22 5.86 0 0.431 6.957 6.8 8.9067 7 330 19.1 3.53 29.6 0
90 2.97 0 0.400 7.088 20.8 7.3073 1 285 15.3 7.85 32.2 0
80 1.76 0 0.385 6.230 31.5 9.0892 1 241 18.2 12.93 20.1 0
33 2.18 0 0.472 6.616 58.1 3.3700 7 222 18.4 8.93 28.4 0
0 9.90 0 0.544 6.122 52.8 2.6403 4 304 18.4 5.98 22.1 0
0 7.38 0 0.493 6.415 40.1 4.7211 5 287 19.6 6.12 25.0 0
0 7.38 0 0.493 6.312 28.9 5.4159 5 287 19.6 6.15 23.0 0
0 5.19 0 0.515 5.895 59.6 5.6150 5 224 20.2 10.56 18.5 0
80 2.01 0 0.435 6.635 29.7 8.3440 4 280 17.0 5.99 24.5 0
0 18.10 0 0.718 3.561 87.9 1.6132 24 666 20.2 7.12 27.5 1
0 18.10 1 0.631 7.016 97.5 1.2024 24 666 20.2 2.96 50.0 1
0 18.10 0 0.584 6.348 86.1 2.0527 24 666 20.2 17.64 14.5 1
0 18.10 0 0.740 5.935 87.9 1.8206 24 666 20.2 34.02 8.4 1
0 18.10 0 0.740 5.627 93.9 1.8172 24 666 20.2 22.88 12.8 1
0 18.10 0 0.740 5.818 92.4 1.8662 24 666 20.2 22.11 10.5 1
0 18.10 0 0.740 6.219 100.0 2.0048 24 666 20.2 16.59 18.4 1
0 18.10 0 0.740 5.854 96.6 1.8956 24 666 20.2 23.79 10.8 1
0 18.10 0 0.713 6.525 86.5 2.4358 24 666 20.2 18.13 14.1 1
0 18.10 0 0.713 6.376 88.4 2.5671 24 666 20.2 14.65 17.7 1
0 18.10 0 0.655 6.209 65.4 2.9634 24 666 20.2 13.22 21.4 1
0 9.69 0 0.585 5.794 70.6 2.8927 6 391 19.2 14.10 18.3 0
0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 5.64 23.9 0