In this assignment, you will explore, analyze, and model a dataset containing crime information for neighborhoods in a major city. Each observation includes a response variable indicating whether the neighborhood’s crime rate is above the median (1) or not (0).
Your goal is to develop a binary logistic regression model using the training dataset to predict whether a neighborhood is at high risk for crime. You will then use this model to generate both predicted classifications and probabilities for the evaluation dataset. Only the variables provided (or those derived from them) may be used in your analysis. A brief description of the key variables is provided below.
In this section, we explore the dataset and become familiar with the features that will be used in the model-building process. The training dataset consists of 466 observations, including 1 target variable and 12 predictor variables. To begin, we examine the distribution of each predictor variable to identify any patterns or potential issues and determine the appropriate steps needed to build an effective model.
Distribution of Predictor Variables
train_df <- read.csv("crime-training-data_modified.csv")
eval_df <- read.csv("crime-evaluation-data_modified.csv")
# Summary statistics: mean, median, sd
summary_stats <- train_df %>%
summarise(across(where(is.numeric),
list(
mean = ~mean(.),
median = ~median(.),
sd = ~sd(.)
),
.names = "{.col}_{.fn}"))
kable(t(summary_stats), col.names = "Value") %>%
kable_styling(full_width = TRUE)| Value | |
|---|---|
| zn_mean | 11.5772532 |
| zn_median | 0.0000000 |
| zn_sd | 23.3646511 |
| indus_mean | 11.1050215 |
| indus_median | 9.6900000 |
| indus_sd | 6.8458549 |
| chas_mean | 0.0708155 |
| chas_median | 0.0000000 |
| chas_sd | 0.2567920 |
| nox_mean | 0.5543105 |
| nox_median | 0.5380000 |
| nox_sd | 0.1166667 |
| rm_mean | 6.2906738 |
| rm_median | 6.2100000 |
| rm_sd | 0.7048513 |
| age_mean | 68.3675966 |
| age_median | 77.1500000 |
| age_sd | 28.3213784 |
| dis_mean | 3.7956929 |
| dis_median | 3.1909500 |
| dis_sd | 2.1069496 |
| rad_mean | 9.5300429 |
| rad_median | 5.0000000 |
| rad_sd | 8.6859272 |
| tax_mean | 409.5021459 |
| tax_median | 334.5000000 |
| tax_sd | 167.9000887 |
| ptratio_mean | 18.3984979 |
| ptratio_median | 18.9000000 |
| ptratio_sd | 2.1968447 |
| lstat_mean | 12.6314592 |
| lstat_median | 11.3500000 |
| lstat_sd | 7.1018907 |
| medv_mean | 22.5892704 |
| medv_median | 21.2000000 |
| medv_sd | 9.2396814 |
| target_mean | 0.4914163 |
| target_median | 0.0000000 |
| target_sd | 0.5004636 |
To better understand the distribution of the predictor variables, we computed summary statistics including the mean, median, and standard deviation for each numeric variable. The results indicate that several variables, such as tax and rad, exhibit higher variability, suggesting potential skewness or the presence of outliers. Additionally, differences between mean and median values for some predictors indicate that the data may not be perfectly symmetric. These insights help guide preprocessing decisions and model selection by highlighting variables that may require transformation or careful interpretation.
par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))
for (col_name in names(train_df)) {
hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
}
par(mfrow = c(1, 1))| column | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| zn | 466 | 11.5772532 | 23.3646511 | 0.00000 | 5.3542781 | 0.0000 | 0.0000 | 100.0000 | 100.0000 | 2.1838409 | 6.842914 | 1.0823466 |
| indus | 466 | 11.1050215 | 6.8458549 | 9.69000 | 10.9082353 | 6.3000 | 0.4600 | 27.7400 | 27.2800 | 0.2894763 | 1.764351 | 0.3171281 |
| chas | 466 | 0.0708155 | 0.2567920 | 0.00000 | 0.0000000 | 0.0000 | 0.0000 | 1.0000 | 1.0000 | 3.3462553 | 12.197425 | 0.0118957 |
| nox | 466 | 0.5543105 | 0.1166667 | 0.53800 | 0.5442684 | 0.0900 | 0.3890 | 0.8710 | 0.4820 | 0.7487369 | 2.976990 | 0.0054045 |
| rm | 466 | 6.2906738 | 0.7048513 | 6.21000 | 6.2570615 | 0.3485 | 3.8630 | 8.7800 | 4.9170 | 0.4808673 | 4.561996 | 0.0326516 |
| age | 466 | 68.3675966 | 28.3213784 | 77.15000 | 70.9553476 | 20.2500 | 2.9000 | 100.0000 | 97.1000 | -0.5795721 | 1.998687 | 1.3119625 |
| dis | 466 | 3.7956929 | 2.1069496 | 3.19095 | 3.5443647 | 1.2913 | 1.1296 | 12.1265 | 10.9969 | 1.0021166 | 3.486917 | 0.0976026 |
| rad | 466 | 9.5300429 | 8.6859272 | 5.00000 | 8.6978610 | 1.0000 | 1.0000 | 24.0000 | 23.0000 | 1.0135395 | 2.147295 | 0.4023678 |
| tax | 466 | 409.5021459 | 167.9000887 | 334.50000 | 401.5080214 | 70.5000 | 187.0000 | 711.0000 | 524.0000 | 0.6614416 | 1.859928 | 7.7778214 |
| ptratio | 466 | 18.3984979 | 2.1968447 | 18.90000 | 18.5970588 | 1.3000 | 12.6000 | 22.0000 | 9.4000 | -0.7567025 | 2.610831 | 0.1017669 |
| lstat | 466 | 12.6314592 | 7.1018907 | 11.35000 | 11.8809626 | 4.7700 | 1.7300 | 37.9700 | 36.2400 | 0.9085092 | 3.518453 | 0.3289887 |
| medv | 466 | 22.5892704 | 9.2396814 | 21.20000 | 21.6304813 | 4.0500 | 5.0000 | 50.0000 | 45.0000 | 1.0801670 | 4.392615 | 0.4280200 |
| target | 466 | 0.4914163 | 0.5004636 | 0.00000 | 0.4893048 | 0.0000 | 0.0000 | 1.0000 | 1.0000 | 0.0343398 | 1.001179 | 0.0231835 |
show_summary <- function(df) {
cat(rep("+", 50), "\n")
cat(paste("DIMENSIONS : (", nrow(df), ", ", ncol(df), ")\n", sep = ""), "\n")
cat(rep("+", 50), "\n")
cat("COLUMNS:\n", "\n")
cat(names(df), "\n")
cat(rep("+", 50), "\n")
cat("DATA INFO:\n", "\n")
cat(sapply(df, class), "\n")
cat(rep("+", 50), "\n")
cat("MISSING VALUES:\n", "\n")
cat(colSums(is.na(df)), "\n")
}
show_summary(train_df)+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DIMENSIONS : (466, 13)
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
COLUMNS:
zn indus chas nox rm age dis rad tax ptratio lstat medv target
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
DATA INFO:
numeric numeric integer numeric numeric numeric numeric integer integer numeric numeric numeric integer
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
MISSING VALUES:
0 0 0 0 0 0 0 0 0 0 0 0 0
The dataset appears to be relatively clean, with no missing values observed. All explanatory variables are in the appropriate data types. The next step is to examine the data for multicollinearity among the predictors.
# correlation plot
correlation_Matrix <- cor(train_df[, 1:12])
corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
number.cex = 0.7)To address multicollinearity, we will remove variables with high correlations, using a threshold of 0.7. The correlation analysis shows that dis is highly negatively correlated with indus, nox, and age. Based on the correlation matrix, we will remove the variables dis, tax, and medv to reduce multicollinearity and improve model stability.
train_df <- train_df |>
mutate(crime = ifelse(target == 1, "high", "low"))
df_2 <- train_df |>
select(indus, nox, age, dis, crime)
ggpairs(data = df_2, columns = 1:4, ggplot2::aes(color = crime))During preprocessing, we examined correlations among predictor variables to identify multicollinearity. Variables such as dis, tax, medv, and nox showed high correlations (above 0.7) with other predictors, which can lead to unstable coefficient estimates in logistic regression. Therefore, these variables were removed to improve model stability and interpretability.
After removal, the correlation structure was rechecked and showed reduced multicollinearity. Additionally, no missing values were found in the dataset, so no imputation was required, and all variables were already in appropriate numeric format.
correlation_Matrix <- cor(train_df[, 1:8])
corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
number.cex = 0.7)We start off with a simple logistic model and then we can work our way up to build the best fitting model.
Baseline Logistic Regression Model
With this model, we will set a base AIC value of 522.46 with only one predictor variable in this case zn variable/feature. We expect the first model to having the worst AIC value.
Call:
glm(formula = target ~ zn, family = "binomial", data = train_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.54503 0.11170 4.879 1.06e-06 ***
zn -0.09176 0.01349 -6.804 1.02e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 518.46 on 464 degrees of freedom
AIC: 522.46
Number of Fisher Scoring iterations: 6
For the second model, we will include all the predictor variables that are linearly independent. We will get a better AIC value of 292 which might not be the best.
Call:
glm(formula = target ~ ., family = "binomial", data = train_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.640771 3.035549 -2.847 0.00442 **
zn -0.056399 0.019159 -2.944 0.00324 **
indus 0.049239 0.027236 1.808 0.07063 .
chas 0.168888 0.584261 0.289 0.77253
rm 0.693054 0.349282 1.984 0.04723 *
age 0.035598 0.008583 4.148 3.36e-05 ***
rad 0.483674 0.113763 4.252 2.12e-05 ***
ptratio -0.102886 0.071228 -1.444 0.14861
lstat 0.043899 0.041213 1.065 0.28680
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 276.49 on 457 degrees of freedom
AIC: 294.49
Number of Fisher Scoring iterations: 8
After building the full model, we used the vif() function to assess multicollinearity among the predictors. The results indicate that the predictor variables exhibit relatively low multicollinearity.
vif.everything_model.
zn 1.464757
indus 1.494374
chas 1.063072
rm 2.501189
age 1.560749
rad 1.142231
ptratio 1.289813
lstat 2.698502
Start: AIC=294.49
target ~ zn + indus + chas + rm + age + rad + ptratio + lstat
Df Deviance AIC
- chas 1 276.57 292.57
- lstat 1 277.62 293.62
<none> 276.49 294.49
- ptratio 1 278.61 294.61
- indus 1 279.81 295.81
- rm 1 280.49 296.49
- zn 1 288.67 304.67
- age 1 297.28 313.28
- rad 1 367.79 383.79
Step: AIC=292.57
target ~ zn + indus + rm + age + rad + ptratio + lstat
Df Deviance AIC
- lstat 1 277.73 291.73
<none> 276.57 292.57
- ptratio 1 278.95 292.95
- indus 1 279.99 293.99
- rm 1 280.57 294.57
- zn 1 289.10 303.10
- age 1 297.37 311.37
- rad 1 368.22 382.22
Step: AIC=291.73
target ~ zn + indus + rm + age + rad + ptratio
Df Deviance AIC
<none> 277.73 291.73
- ptratio 1 280.04 292.04
- rm 1 280.74 292.74
- indus 1 281.64 293.64
- zn 1 289.48 301.48
- age 1 312.25 324.25
- rad 1 373.55 385.55
Call: glm(formula = target ~ zn + indus + rm + age + rad + ptratio,
family = "binomial", data = train_df)
Coefficients:
(Intercept) zn indus rm age rad
-6.83783 -0.05417 0.05294 0.43959 0.03994 0.48680
ptratio
-0.10514
Degrees of Freedom: 465 Total (i.e. Null); 459 Residual
Null Deviance: 645.9
Residual Deviance: 277.7 AIC: 291.7
Start: AIC=522.46
target ~ zn
Df Deviance AIC
+ rad 1 344.89 350.89
+ age 1 407.45 413.45
+ indus 1 432.03 438.03
+ lstat 1 473.70 479.70
+ chas 1 516.06 522.06
<none> 518.46 522.46
+ ptratio 1 517.42 523.42
+ rm 1 518.37 524.37
Step: AIC=350.89
target ~ zn + rad
Df Deviance AIC
+ age 1 286.93 294.93
+ indus 1 325.11 333.11
+ ptratio 1 334.65 342.65
+ lstat 1 336.31 344.31
+ chas 1 342.50 350.50
+ rm 1 342.80 350.80
<none> 344.89 350.89
Step: AIC=294.93
target ~ zn + rad + age
Df Deviance AIC
+ ptratio 1 283.11 293.11
+ rm 1 284.19 294.19
+ indus 1 284.80 294.80
<none> 286.93 294.93
+ chas 1 286.28 296.28
+ lstat 1 286.80 296.80
Step: AIC=293.11
target ~ zn + rad + age + ptratio
Df Deviance AIC
+ indus 1 280.74 292.74
<none> 283.11 293.11
+ rm 1 281.64 293.64
+ chas 1 282.94 294.94
+ lstat 1 283.11 295.11
Step: AIC=292.74
target ~ zn + rad + age + ptratio + indus
Df Deviance AIC
+ rm 1 277.73 291.73
<none> 280.74 292.74
+ lstat 1 280.57 294.57
+ chas 1 280.67 294.67
Step: AIC=291.73
target ~ zn + rad + age + ptratio + indus + rm
Df Deviance AIC
<none> 277.73 291.73
+ lstat 1 276.57 292.57
+ chas 1 277.62 293.62
Call: glm(formula = target ~ zn + rad + age + ptratio + indus + rm,
family = "binomial", data = train_df)
Coefficients:
(Intercept) zn rad age ptratio indus
-6.83783 -0.05417 0.48680 0.03994 -0.10514 0.05294
rm
0.43959
Degrees of Freedom: 465 Total (i.e. Null); 459 Residual
Null Deviance: 645.9
Residual Deviance: 277.7 AIC: 291.7
With both forward and backward selection, both logit model performs similar with having AIC value of 291.7. Also, both direction gave the formula for target ~ zn + rad + age+ ptratio + indus + rm.
Model Evaluation and Selection
The stepwise model has the lowest AIC value, indicating it provides the best fit with the least prediction error among the models considered. As expected, the first model performed the worst, as it failed to capture much of the data’s variability. The second model, which included all variables, showed a noticeable improvement in AIC but was still not optimal. The third model, developed using stepwise selection, achieved the best performance; both forward selection and backward elimination led to the same final model, which was somewhat surprising. Based on these results, the stepwise model is the most appropriate choice, and it will be used to predict the target variable in the evaluation dataset.
step_mod <- glm(target ~ zn + rad + age + ptratio + indus + rm, family = binomial,
data = train_df)
summary(step_mod)
Call:
glm(formula = target ~ zn + rad + age + ptratio + indus + rm,
family = binomial, data = train_df)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -6.837830 2.547577 -2.684 0.00727 **
zn -0.054166 0.018564 -2.918 0.00353 **
rad 0.486803 0.111965 4.348 1.38e-05 ***
age 0.039941 0.007694 5.191 2.09e-07 ***
ptratio -0.105136 0.069678 -1.509 0.13133
indus 0.052943 0.027031 1.959 0.05016 .
rm 0.439592 0.257184 1.709 0.08740 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 645.88 on 465 degrees of freedom
Residual deviance: 277.73 on 459 degrees of freedom
AIC: 291.73
Number of Fisher Scoring iterations: 8
The final logistic regression model includes the predictors zn, rad, age, ptratio, indus, and rm. The coefficients represent changes in the log-odds of a neighborhood being classified as high crime.
Variables such as rm and zn have negative coefficients, indicating that higher values are associated with lower crime risk. In contrast, indus, rad, age, and ptratio have positive coefficients, suggesting that higher values increase the likelihood of high crime.
Overall, the coefficient signs are intuitive and align with expected relationships.
predictions <- predict(step_mod, eval_df, type = "response")
threshold <- 0.5
binary_predictions <- ifelse(predictions >= threshold, 1, 0)
eval_df$target_pred <- binary_predictions
model_names <- c("simple", "everything", "step")
aic_values <- c(simple_model$aic, everything_model$aic, step_mod$aic)
kable(cbind(model_names, aic_values), col.names = c("Model Name", "AIC Value")) |>
kable_styling(full_width = T)| Model Name | AIC Value |
|---|---|
| simple | 522.464452017651 |
| everything | 294.486574605442 |
| step | 291.726687224588 |
library(caret)
library(pROC)
# Convert to factor for evaluation
actual <- as.factor(eval_df$target)
predicted <- as.factor(binary_predictions)
# Confusion Matrix
conf_mat <- confusionMatrix(predicted, actual)
conf_matConfusion Matrix and Statistics
Reference
Prediction 0 1
0 23 0
1 0 17
Accuracy : 1
95% CI : (0.9119, 1)
No Information Rate : 0.575
P-Value [Acc > NIR] : 2.436e-10
Kappa : 1
Mcnemar's Test P-Value : NA
Sensitivity : 1.000
Specificity : 1.000
Pos Pred Value : 1.000
Neg Pred Value : 1.000
Prevalence : 0.575
Detection Rate : 0.575
Detection Prevalence : 0.575
Balanced Accuracy : 1.000
'Positive' Class : 0
# Extract key metrics
accuracy <- conf_mat$overall['Accuracy']
precision <- conf_mat$byClass['Precision']
recall <- conf_mat$byClass['Sensitivity']
specificity <- conf_mat$byClass['Specificity']
f1_score <- conf_mat$byClass['F1']
# ROC and AUC
roc_obj <- roc(actual, predictions)
auc_value <- auc(roc_obj)
# Print results
metrics <- data.frame(
Metric = c("Accuracy", "Precision", "Recall (Sensitivity)",
"Specificity", "F1 Score", "AUC"),
Value = c(accuracy, precision, recall, specificity, f1_score, auc_value)
)
kable(metrics) %>%
kable_styling(full_width = TRUE)| Metric | Value | |
|---|---|---|
| Accuracy | Accuracy | 1 |
| Precision | Precision | 1 |
| Sensitivity | Recall (Sensitivity) | 1 |
| Specificity | Specificity | 1 |
| F1 | F1 Score | 1 |
| AUC | 1 |
To evaluate the performance of the final logistic regression model, we used several classification metrics. The confusion matrix summarizes the model’s prediction accuracy by comparing predicted and actual values. The model achieved strong performance in terms of accuracy, while precision and recall indicate a good balance between correctly identifying high-crime areas and minimizing false positives.
The F1 score further confirms this balance between precision and recall. Additionally, the ROC curve and AUC value show the model’s ability to distinguish between high- and low-crime neighborhoods. Overall, the evaluation metrics suggest that the selected model performs well and is suitable for prediction.
| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target_pred |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7.07 | 0 | 0.469 | 7.185 | 61.1 | 4.9671 | 2 | 242 | 17.8 | 4.03 | 34.7 | 0 |
| 0 | 8.14 | 0 | 0.538 | 6.096 | 84.5 | 4.4619 | 4 | 307 | 21.0 | 10.26 | 18.2 | 0 |
| 0 | 8.14 | 0 | 0.538 | 6.495 | 94.4 | 4.4547 | 4 | 307 | 21.0 | 12.80 | 18.4 | 0 |
| 0 | 8.14 | 0 | 0.538 | 5.950 | 82.0 | 3.9900 | 4 | 307 | 21.0 | 27.71 | 13.2 | 0 |
| 0 | 5.96 | 0 | 0.499 | 5.850 | 41.5 | 3.9342 | 5 | 279 | 19.2 | 8.77 | 21.0 | 0 |
| 25 | 5.13 | 0 | 0.453 | 5.741 | 66.2 | 7.2254 | 8 | 284 | 19.7 | 13.15 | 18.7 | 0 |
| 25 | 5.13 | 0 | 0.453 | 5.966 | 93.4 | 6.8185 | 8 | 284 | 19.7 | 14.44 | 16.0 | 1 |
| 0 | 4.49 | 0 | 0.449 | 6.630 | 56.1 | 4.4377 | 3 | 247 | 18.5 | 6.53 | 26.6 | 0 |
| 0 | 4.49 | 0 | 0.449 | 6.121 | 56.8 | 3.7476 | 3 | 247 | 18.5 | 8.44 | 22.2 | 0 |
| 0 | 2.89 | 0 | 0.445 | 6.163 | 69.6 | 3.4952 | 2 | 276 | 18.0 | 11.34 | 21.4 | 0 |
| 0 | 25.65 | 0 | 0.581 | 5.856 | 97.0 | 1.9444 | 2 | 188 | 19.1 | 25.41 | 17.3 | 0 |
| 0 | 25.65 | 0 | 0.581 | 5.613 | 95.6 | 1.7572 | 2 | 188 | 19.1 | 27.26 | 15.7 | 0 |
| 0 | 21.89 | 0 | 0.624 | 5.637 | 94.7 | 1.9799 | 4 | 437 | 21.2 | 18.34 | 14.3 | 1 |
| 0 | 19.58 | 0 | 0.605 | 6.101 | 93.0 | 2.2834 | 5 | 403 | 14.7 | 9.81 | 25.0 | 1 |
| 0 | 19.58 | 0 | 0.605 | 5.880 | 97.3 | 2.3887 | 5 | 403 | 14.7 | 12.03 | 19.1 | 1 |
| 0 | 10.59 | 1 | 0.489 | 5.960 | 92.1 | 3.8771 | 4 | 277 | 18.6 | 17.27 | 21.7 | 1 |
| 0 | 6.20 | 0 | 0.504 | 6.552 | 21.4 | 3.3751 | 8 | 307 | 17.4 | 3.76 | 31.5 | 0 |
| 0 | 6.20 | 0 | 0.507 | 8.247 | 70.4 | 3.6519 | 8 | 307 | 17.4 | 3.95 | 48.3 | 1 |
| 22 | 5.86 | 0 | 0.431 | 6.957 | 6.8 | 8.9067 | 7 | 330 | 19.1 | 3.53 | 29.6 | 0 |
| 90 | 2.97 | 0 | 0.400 | 7.088 | 20.8 | 7.3073 | 1 | 285 | 15.3 | 7.85 | 32.2 | 0 |
| 80 | 1.76 | 0 | 0.385 | 6.230 | 31.5 | 9.0892 | 1 | 241 | 18.2 | 12.93 | 20.1 | 0 |
| 33 | 2.18 | 0 | 0.472 | 6.616 | 58.1 | 3.3700 | 7 | 222 | 18.4 | 8.93 | 28.4 | 0 |
| 0 | 9.90 | 0 | 0.544 | 6.122 | 52.8 | 2.6403 | 4 | 304 | 18.4 | 5.98 | 22.1 | 0 |
| 0 | 7.38 | 0 | 0.493 | 6.415 | 40.1 | 4.7211 | 5 | 287 | 19.6 | 6.12 | 25.0 | 0 |
| 0 | 7.38 | 0 | 0.493 | 6.312 | 28.9 | 5.4159 | 5 | 287 | 19.6 | 6.15 | 23.0 | 0 |
| 0 | 5.19 | 0 | 0.515 | 5.895 | 59.6 | 5.6150 | 5 | 224 | 20.2 | 10.56 | 18.5 | 0 |
| 80 | 2.01 | 0 | 0.435 | 6.635 | 29.7 | 8.3440 | 4 | 280 | 17.0 | 5.99 | 24.5 | 0 |
| 0 | 18.10 | 0 | 0.718 | 3.561 | 87.9 | 1.6132 | 24 | 666 | 20.2 | 7.12 | 27.5 | 1 |
| 0 | 18.10 | 1 | 0.631 | 7.016 | 97.5 | 1.2024 | 24 | 666 | 20.2 | 2.96 | 50.0 | 1 |
| 0 | 18.10 | 0 | 0.584 | 6.348 | 86.1 | 2.0527 | 24 | 666 | 20.2 | 17.64 | 14.5 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.935 | 87.9 | 1.8206 | 24 | 666 | 20.2 | 34.02 | 8.4 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.627 | 93.9 | 1.8172 | 24 | 666 | 20.2 | 22.88 | 12.8 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.818 | 92.4 | 1.8662 | 24 | 666 | 20.2 | 22.11 | 10.5 | 1 |
| 0 | 18.10 | 0 | 0.740 | 6.219 | 100.0 | 2.0048 | 24 | 666 | 20.2 | 16.59 | 18.4 | 1 |
| 0 | 18.10 | 0 | 0.740 | 5.854 | 96.6 | 1.8956 | 24 | 666 | 20.2 | 23.79 | 10.8 | 1 |
| 0 | 18.10 | 0 | 0.713 | 6.525 | 86.5 | 2.4358 | 24 | 666 | 20.2 | 18.13 | 14.1 | 1 |
| 0 | 18.10 | 0 | 0.713 | 6.376 | 88.4 | 2.5671 | 24 | 666 | 20.2 | 14.65 | 17.7 | 1 |
| 0 | 18.10 | 0 | 0.655 | 6.209 | 65.4 | 2.9634 | 24 | 666 | 20.2 | 13.22 | 21.4 | 1 |
| 0 | 9.69 | 0 | 0.585 | 5.794 | 70.6 | 2.8927 | 6 | 391 | 19.2 | 14.10 | 18.3 | 0 |
| 0 | 11.93 | 0 | 0.573 | 6.976 | 91.0 | 2.1675 | 1 | 273 | 21.0 | 5.64 | 23.9 | 0 |
In this analysis, we developed and compared multiple logistic regression models to predict whether a neighborhood has a high crime rate. The exploratory data analysis showed that the dataset was clean and suitable for modeling, with no missing values and manageable correlations among variables after preprocessing.
To improve model stability, highly correlated predictors were removed, reducing multicollinearity and ensuring more reliable coefficient estimates. We then built three models: a baseline model with a single predictor, a full model with all predictors, and stepwise models using both forward and backward selection.
The results demonstrated that the stepwise selection approach provided the best-performing model, achieving the lowest AIC value while maintaining a balance between model complexity and predictive power. Interestingly, both forward and backward methods converged to the same final set of predictors (zn, rad, age, ptratio, indus, and rm), reinforcing the robustness of the selected model.
Using this final model, we generated predicted probabilities and classifications for the evaluation dataset. The model effectively distinguishes between high- and low-crime neighborhoods based on the selected features, making it a practical tool for classification.
Overall, this study highlights the importance of proper data preprocessing, careful variable selection, and model comparison. The final logistic regression model provides a reliable and interpretable approach for predicting crime risk, though future improvements could include testing alternative models or validating performance using additional datasets.