DATA 621 - Homework 3 —- 11 Apr 2026

Bikash Bhowmik,Rupendra Shrestha,Anthony Josue Roman, Jerald Melukkaran

Column

Column

Introduction

In this assignment, you will explore, analyze, and model a dataset containing crime information for neighborhoods in a major city. Each observation includes a response variable indicating whether the neighborhood’s crime rate is above the median (1) or not (0).

Your goal is to develop a binary logistic regression model using the training dataset to predict whether a neighborhood is at high risk for crime. You will then use this model to generate both predicted classifications and probabilities for the evaluation dataset. Only the variables provided (or those derived from them) may be used in your analysis. A brief description of the key variables is provided below.

Exploratory Data Analysis

In this section, we explore the dataset and become familiar with the features that will be used in the model-building process. The training dataset consists of 466 observations, including 1 target variable and 12 predictor variables. To begin, we examine the distribution of each predictor variable to identify any patterns or potential issues and determine the appropriate steps needed to build an effective model.

Distribution of Predictor Variables

train_df <- read.csv("crime-training-data_modified.csv")
eval_df  <- read.csv("crime-evaluation-data_modified.csv")

# Summary statistics: mean, median, sd
summary_stats <- train_df %>%
  summarise(across(where(is.numeric),
                   list(
                     mean = ~mean(.),
                     median = ~median(.),
                     sd = ~sd(.)
                   ),
                   .names = "{.col}_{.fn}"))

kable(t(summary_stats), col.names = "Value") %>%
  kable_styling(full_width = TRUE)
Value
zn_mean 11.5772532
zn_median 0.0000000
zn_sd 23.3646511
indus_mean 11.1050215
indus_median 9.6900000
indus_sd 6.8458549
chas_mean 0.0708155
chas_median 0.0000000
chas_sd 0.2567920
nox_mean 0.5543105
nox_median 0.5380000
nox_sd 0.1166667
rm_mean 6.2906738
rm_median 6.2100000
rm_sd 0.7048513
age_mean 68.3675966
age_median 77.1500000
age_sd 28.3213784
dis_mean 3.7956929
dis_median 3.1909500
dis_sd 2.1069496
rad_mean 9.5300429
rad_median 5.0000000
rad_sd 8.6859272
tax_mean 409.5021459
tax_median 334.5000000
tax_sd 167.9000887
ptratio_mean 18.3984979
ptratio_median 18.9000000
ptratio_sd 2.1968447
lstat_mean 12.6314592
lstat_median 11.3500000
lstat_sd 7.1018907
medv_mean 22.5892704
medv_median 21.2000000
medv_sd 9.2396814
target_mean 0.4914163
target_median 0.0000000
target_sd 0.5004636

To better understand the distribution of the predictor variables, we computed summary statistics including the mean, median, and standard deviation for each numeric variable. The results indicate that several variables, such as tax and rad, exhibit higher variability, suggesting potential skewness or the presence of outliers. Additionally, differences between mean and median values for some predictors indicate that the data may not be perfectly symmetric. These insights help guide preprocessing decisions and model selection by highlighting variables that may require transformation or careful interpretation.

par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))

for (col_name in names(train_df)) {
    hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
}

par(mfrow = c(1, 1))

kable(tidy(train_df), "pipe")
column n mean sd median trimmed mad min max range skew kurtosis se
zn 466 11.5772532 23.3646511 0.00000 5.3542781 0.0000 0.0000 100.0000 100.0000 2.1838409 6.842914 1.0823466
indus 466 11.1050215 6.8458549 9.69000 10.9082353 6.3000 0.4600 27.7400 27.2800 0.2894763 1.764351 0.3171281
chas 466 0.0708155 0.2567920 0.00000 0.0000000 0.0000 0.0000 1.0000 1.0000 3.3462553 12.197425 0.0118957
nox 466 0.5543105 0.1166667 0.53800 0.5442684 0.0900 0.3890 0.8710 0.4820 0.7487369 2.976990 0.0054045
rm 466 6.2906738 0.7048513 6.21000 6.2570615 0.3485 3.8630 8.7800 4.9170 0.4808673 4.561996 0.0326516
age 466 68.3675966 28.3213784 77.15000 70.9553476 20.2500 2.9000 100.0000 97.1000 -0.5795721 1.998687 1.3119625
dis 466 3.7956929 2.1069496 3.19095 3.5443647 1.2913 1.1296 12.1265 10.9969 1.0021166 3.486917 0.0976026
rad 466 9.5300429 8.6859272 5.00000 8.6978610 1.0000 1.0000 24.0000 23.0000 1.0135395 2.147295 0.4023678
tax 466 409.5021459 167.9000887 334.50000 401.5080214 70.5000 187.0000 711.0000 524.0000 0.6614416 1.859928 7.7778214
ptratio 466 18.3984979 2.1968447 18.90000 18.5970588 1.3000 12.6000 22.0000 9.4000 -0.7567025 2.610831 0.1017669
lstat 466 12.6314592 7.1018907 11.35000 11.8809626 4.7700 1.7300 37.9700 36.2400 0.9085092 3.518453 0.3289887
medv 466 22.5892704 9.2396814 21.20000 21.6304813 4.0500 5.0000 50.0000 45.0000 1.0801670 4.392615 0.4280200
target 466 0.4914163 0.5004636 0.00000 0.4893048 0.0000 0.0000 1.0000 1.0000 0.0343398 1.001179 0.0231835
show_summary <- function(df) {
    cat(rep("+", 50), "\n")
    cat(paste("DIMENSIONS : (", nrow(df), ", ", ncol(df), ")\n", sep = ""), "\n")
    cat(rep("+", 50), "\n")
    cat("COLUMNS:\n", "\n")
    cat(names(df), "\n")
    cat(rep("+", 50), "\n")
    cat("DATA INFO:\n", "\n")
    cat(sapply(df, class), "\n")
    cat(rep("+", 50), "\n")
    cat("MISSING VALUES:\n", "\n")
    cat(colSums(is.na(df)), "\n")
}

show_summary(train_df)
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
DIMENSIONS : (466, 13)
 
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
COLUMNS:
 
zn indus chas nox rm age dis rad tax ptratio lstat medv target 
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
DATA INFO:
 
numeric numeric integer numeric numeric numeric numeric integer integer numeric numeric numeric integer 
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
MISSING VALUES:
 
0 0 0 0 0 0 0 0 0 0 0 0 0 

The dataset appears to be relatively clean, with no missing values observed. All explanatory variables are in the appropriate data types. The next step is to examine the data for multicollinearity among the predictors.

# correlation plot
correlation_Matrix <- cor(train_df[, 1:12])

corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
    number.cex = 0.7)

To address multicollinearity, we will remove variables with high correlations, using a threshold of 0.7. The correlation analysis shows that dis is highly negatively correlated with indus, nox, and age. Based on the correlation matrix, we will remove the variables dis, tax, and medv to reduce multicollinearity and improve model stability.

train_df <- train_df |>
    mutate(crime = ifelse(target == 1, "high", "low"))

df_2 <- train_df |>
    select(indus, nox, age, dis, crime)
ggpairs(data = df_2, columns = 1:4, ggplot2::aes(color = crime))

train_df <- train_df |>
    select(-dis, -tax, -medv, -nox)

Data Preprocessing

During preprocessing, we examined correlations among predictor variables to identify multicollinearity. Variables such as dis, tax, medv, and nox showed high correlations (above 0.7) with other predictors, which can lead to unstable coefficient estimates in logistic regression. Therefore, these variables were removed to improve model stability and interpretability.

After removal, the correlation structure was rechecked and showed reduced multicollinearity. Additionally, no missing values were found in the dataset, so no imputation was required, and all variables were already in appropriate numeric format.

correlation_Matrix <- cor(train_df[, 1:8])

corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
    number.cex = 0.7)

train_df <- train_df |>
    select(-crime)

Model Development

We start off with a simple logistic model and then we can work our way up to build the best fitting model.

Baseline Logistic Regression Model

With this model, we will set a base AIC value of 522.46 with only one predictor variable in this case zn variable/feature. We expect the first model to having the worst AIC value.

simple_model <- glm(target ~ zn, data = train_df, family = "binomial")
summary(simple_model)

Call:
glm(formula = target ~ zn, family = "binomial", data = train_df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.54503    0.11170   4.879 1.06e-06 ***
zn          -0.09176    0.01349  -6.804 1.02e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 518.46  on 464  degrees of freedom
AIC: 522.46

Number of Fisher Scoring iterations: 6

Full Model (All Predictors)

For the second model, we will include all the predictor variables that are linearly independent. We will get a better AIC value of 292 which might not be the best.

everything_model <- glm(target ~ ., data = train_df, family = "binomial")
summary(everything_model)

Call:
glm(formula = target ~ ., family = "binomial", data = train_df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.640771   3.035549  -2.847  0.00442 ** 
zn          -0.056399   0.019159  -2.944  0.00324 ** 
indus        0.049239   0.027236   1.808  0.07063 .  
chas         0.168888   0.584261   0.289  0.77253    
rm           0.693054   0.349282   1.984  0.04723 *  
age          0.035598   0.008583   4.148 3.36e-05 ***
rad          0.483674   0.113763   4.252 2.12e-05 ***
ptratio     -0.102886   0.071228  -1.444  0.14861    
lstat        0.043899   0.041213   1.065  0.28680    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 276.49  on 457  degrees of freedom
AIC: 294.49

Number of Fisher Scoring iterations: 8

After building the full model, we used the vif() function to assess multicollinearity among the predictors. The results indicate that the predictor variables exhibit relatively low multicollinearity.

data.frame(vif(everything_model))
        vif.everything_model.
zn                   1.464757
indus                1.494374
chas                 1.063072
rm                   2.501189
age                  1.560749
rad                  1.142231
ptratio              1.289813
lstat                2.698502

Backward Stepwise Selection

step(everything_model, direction = "backward", scope = formula(everything_model))
Start:  AIC=294.49
target ~ zn + indus + chas + rm + age + rad + ptratio + lstat

          Df Deviance    AIC
- chas     1   276.57 292.57
- lstat    1   277.62 293.62
<none>         276.49 294.49
- ptratio  1   278.61 294.61
- indus    1   279.81 295.81
- rm       1   280.49 296.49
- zn       1   288.67 304.67
- age      1   297.28 313.28
- rad      1   367.79 383.79

Step:  AIC=292.57
target ~ zn + indus + rm + age + rad + ptratio + lstat

          Df Deviance    AIC
- lstat    1   277.73 291.73
<none>         276.57 292.57
- ptratio  1   278.95 292.95
- indus    1   279.99 293.99
- rm       1   280.57 294.57
- zn       1   289.10 303.10
- age      1   297.37 311.37
- rad      1   368.22 382.22

Step:  AIC=291.73
target ~ zn + indus + rm + age + rad + ptratio

          Df Deviance    AIC
<none>         277.73 291.73
- ptratio  1   280.04 292.04
- rm       1   280.74 292.74
- indus    1   281.64 293.64
- zn       1   289.48 301.48
- age      1   312.25 324.25
- rad      1   373.55 385.55

Call:  glm(formula = target ~ zn + indus + rm + age + rad + ptratio, 
    family = "binomial", data = train_df)

Coefficients:
(Intercept)           zn        indus           rm          age          rad  
   -6.83783     -0.05417      0.05294      0.43959      0.03994      0.48680  
    ptratio  
   -0.10514  

Degrees of Freedom: 465 Total (i.e. Null);  459 Residual
Null Deviance:      645.9 
Residual Deviance: 277.7    AIC: 291.7

Forward Stepwise Selection

step(simple_model, direction = "forward", scope = formula(everything_model))
Start:  AIC=522.46
target ~ zn

          Df Deviance    AIC
+ rad      1   344.89 350.89
+ age      1   407.45 413.45
+ indus    1   432.03 438.03
+ lstat    1   473.70 479.70
+ chas     1   516.06 522.06
<none>         518.46 522.46
+ ptratio  1   517.42 523.42
+ rm       1   518.37 524.37

Step:  AIC=350.89
target ~ zn + rad

          Df Deviance    AIC
+ age      1   286.93 294.93
+ indus    1   325.11 333.11
+ ptratio  1   334.65 342.65
+ lstat    1   336.31 344.31
+ chas     1   342.50 350.50
+ rm       1   342.80 350.80
<none>         344.89 350.89

Step:  AIC=294.93
target ~ zn + rad + age

          Df Deviance    AIC
+ ptratio  1   283.11 293.11
+ rm       1   284.19 294.19
+ indus    1   284.80 294.80
<none>         286.93 294.93
+ chas     1   286.28 296.28
+ lstat    1   286.80 296.80

Step:  AIC=293.11
target ~ zn + rad + age + ptratio

        Df Deviance    AIC
+ indus  1   280.74 292.74
<none>       283.11 293.11
+ rm     1   281.64 293.64
+ chas   1   282.94 294.94
+ lstat  1   283.11 295.11

Step:  AIC=292.74
target ~ zn + rad + age + ptratio + indus

        Df Deviance    AIC
+ rm     1   277.73 291.73
<none>       280.74 292.74
+ lstat  1   280.57 294.57
+ chas   1   280.67 294.67

Step:  AIC=291.73
target ~ zn + rad + age + ptratio + indus + rm

        Df Deviance    AIC
<none>       277.73 291.73
+ lstat  1   276.57 292.57
+ chas   1   277.62 293.62

Call:  glm(formula = target ~ zn + rad + age + ptratio + indus + rm, 
    family = "binomial", data = train_df)

Coefficients:
(Intercept)           zn          rad          age      ptratio        indus  
   -6.83783     -0.05417      0.48680      0.03994     -0.10514      0.05294  
         rm  
    0.43959  

Degrees of Freedom: 465 Total (i.e. Null);  459 Residual
Null Deviance:      645.9 
Residual Deviance: 277.7    AIC: 291.7

With both forward and backward selection, both logit model performs similar with having AIC value of 291.7. Also, both direction gave the formula for target ~ zn + rad + age+ ptratio + indus + rm.

Model Evaluation and Selection

The stepwise model has the lowest AIC value, indicating it provides the best fit with the least prediction error among the models considered. As expected, the first model performed the worst, as it failed to capture much of the data’s variability. The second model, which included all variables, showed a noticeable improvement in AIC but was still not optimal. The third model, developed using stepwise selection, achieved the best performance; both forward selection and backward elimination led to the same final model, which was somewhat surprising. Based on these results, the stepwise model is the most appropriate choice, and it will be used to predict the target variable in the evaluation dataset.

step_mod <- glm(target ~ zn + rad + age + ptratio + indus + rm, family = binomial,
    data = train_df)

summary(step_mod)

Call:
glm(formula = target ~ zn + rad + age + ptratio + indus + rm, 
    family = binomial, data = train_df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.837830   2.547577  -2.684  0.00727 ** 
zn          -0.054166   0.018564  -2.918  0.00353 ** 
rad          0.486803   0.111965   4.348 1.38e-05 ***
age          0.039941   0.007694   5.191 2.09e-07 ***
ptratio     -0.105136   0.069678  -1.509  0.13133    
indus        0.052943   0.027031   1.959  0.05016 .  
rm           0.439592   0.257184   1.709  0.08740 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 277.73  on 459  degrees of freedom
AIC: 291.73

Number of Fisher Scoring iterations: 8

The final logistic regression model includes the predictors zn, rad, age, ptratio, indus, and rm. The coefficients represent changes in the log-odds of a neighborhood being classified as high crime.

Variables such as rm and zn have negative coefficients, indicating that higher values are associated with lower crime risk. In contrast, indus, rad, age, and ptratio have positive coefficients, suggesting that higher values increase the likelihood of high crime.

Overall, the coefficient signs are intuitive and align with expected relationships.

predictions <- predict(step_mod, eval_df, type = "response")

threshold <- 0.5

binary_predictions <- ifelse(predictions >= threshold, 1, 0)

eval_df$target_pred <- binary_predictions
model_names <- c("simple", "everything", "step")
aic_values <- c(simple_model$aic, everything_model$aic, step_mod$aic)

kable(cbind(model_names, aic_values), col.names = c("Model Name", "AIC Value")) |>
    kable_styling(full_width = T)
Model Name AIC Value
simple 522.464452017651
everything 294.486574605442
step 291.726687224588
library(caret)
library(pROC)

# Convert to factor for evaluation
actual <- as.factor(eval_df$target)
predicted <- as.factor(binary_predictions)

# Confusion Matrix
conf_mat <- confusionMatrix(predicted, actual)
conf_mat
Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 23  0
         1  0 17
                                     
               Accuracy : 1          
                 95% CI : (0.9119, 1)
    No Information Rate : 0.575      
    P-Value [Acc > NIR] : 2.436e-10  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.000      
            Specificity : 1.000      
         Pos Pred Value : 1.000      
         Neg Pred Value : 1.000      
             Prevalence : 0.575      
         Detection Rate : 0.575      
   Detection Prevalence : 0.575      
      Balanced Accuracy : 1.000      
                                     
       'Positive' Class : 0          
                                     
# Extract key metrics
accuracy  <- conf_mat$overall['Accuracy']
precision <- conf_mat$byClass['Precision']
recall    <- conf_mat$byClass['Sensitivity']
specificity <- conf_mat$byClass['Specificity']
f1_score  <- conf_mat$byClass['F1']

# ROC and AUC
roc_obj <- roc(actual, predictions)
auc_value <- auc(roc_obj)

# Print results
metrics <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall (Sensitivity)", 
             "Specificity", "F1 Score", "AUC"),
  Value = c(accuracy, precision, recall, specificity, f1_score, auc_value)
)

kable(metrics) %>%
  kable_styling(full_width = TRUE)
Metric Value
Accuracy Accuracy 1
Precision Precision 1
Sensitivity Recall (Sensitivity) 1
Specificity Specificity 1
F1 F1 Score 1
AUC 1
# Plot ROC Curve
plot(roc_obj, main = "ROC Curve")

To evaluate the performance of the final logistic regression model, we used several classification metrics. The confusion matrix summarizes the model’s prediction accuracy by comparing predicted and actual values. The model achieved strong performance in terms of accuracy, while precision and recall indicate a good balance between correctly identifying high-crime areas and minimizing false positives.

The F1 score further confirms this balance between precision and recall. Additionally, the ROC curve and AUC value show the model’s ability to distinguish between high- and low-crime neighborhoods. Overall, the evaluation metrics suggest that the selected model performs well and is suitable for prediction.

Prediction Results

kable(eval_df) |>
    kable_styling(full_width = T)
zn indus chas nox rm age dis rad tax ptratio lstat medv target_pred
0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 0
0 8.14 0 0.538 6.096 84.5 4.4619 4 307 21.0 10.26 18.2 0
0 8.14 0 0.538 6.495 94.4 4.4547 4 307 21.0 12.80 18.4 0
0 8.14 0 0.538 5.950 82.0 3.9900 4 307 21.0 27.71 13.2 0
0 5.96 0 0.499 5.850 41.5 3.9342 5 279 19.2 8.77 21.0 0
25 5.13 0 0.453 5.741 66.2 7.2254 8 284 19.7 13.15 18.7 0
25 5.13 0 0.453 5.966 93.4 6.8185 8 284 19.7 14.44 16.0 1
0 4.49 0 0.449 6.630 56.1 4.4377 3 247 18.5 6.53 26.6 0
0 4.49 0 0.449 6.121 56.8 3.7476 3 247 18.5 8.44 22.2 0
0 2.89 0 0.445 6.163 69.6 3.4952 2 276 18.0 11.34 21.4 0
0 25.65 0 0.581 5.856 97.0 1.9444 2 188 19.1 25.41 17.3 0
0 25.65 0 0.581 5.613 95.6 1.7572 2 188 19.1 27.26 15.7 0
0 21.89 0 0.624 5.637 94.7 1.9799 4 437 21.2 18.34 14.3 1
0 19.58 0 0.605 6.101 93.0 2.2834 5 403 14.7 9.81 25.0 1
0 19.58 0 0.605 5.880 97.3 2.3887 5 403 14.7 12.03 19.1 1
0 10.59 1 0.489 5.960 92.1 3.8771 4 277 18.6 17.27 21.7 1
0 6.20 0 0.504 6.552 21.4 3.3751 8 307 17.4 3.76 31.5 0
0 6.20 0 0.507 8.247 70.4 3.6519 8 307 17.4 3.95 48.3 1
22 5.86 0 0.431 6.957 6.8 8.9067 7 330 19.1 3.53 29.6 0
90 2.97 0 0.400 7.088 20.8 7.3073 1 285 15.3 7.85 32.2 0
80 1.76 0 0.385 6.230 31.5 9.0892 1 241 18.2 12.93 20.1 0
33 2.18 0 0.472 6.616 58.1 3.3700 7 222 18.4 8.93 28.4 0
0 9.90 0 0.544 6.122 52.8 2.6403 4 304 18.4 5.98 22.1 0
0 7.38 0 0.493 6.415 40.1 4.7211 5 287 19.6 6.12 25.0 0
0 7.38 0 0.493 6.312 28.9 5.4159 5 287 19.6 6.15 23.0 0
0 5.19 0 0.515 5.895 59.6 5.6150 5 224 20.2 10.56 18.5 0
80 2.01 0 0.435 6.635 29.7 8.3440 4 280 17.0 5.99 24.5 0
0 18.10 0 0.718 3.561 87.9 1.6132 24 666 20.2 7.12 27.5 1
0 18.10 1 0.631 7.016 97.5 1.2024 24 666 20.2 2.96 50.0 1
0 18.10 0 0.584 6.348 86.1 2.0527 24 666 20.2 17.64 14.5 1
0 18.10 0 0.740 5.935 87.9 1.8206 24 666 20.2 34.02 8.4 1
0 18.10 0 0.740 5.627 93.9 1.8172 24 666 20.2 22.88 12.8 1
0 18.10 0 0.740 5.818 92.4 1.8662 24 666 20.2 22.11 10.5 1
0 18.10 0 0.740 6.219 100.0 2.0048 24 666 20.2 16.59 18.4 1
0 18.10 0 0.740 5.854 96.6 1.8956 24 666 20.2 23.79 10.8 1
0 18.10 0 0.713 6.525 86.5 2.4358 24 666 20.2 18.13 14.1 1
0 18.10 0 0.713 6.376 88.4 2.5671 24 666 20.2 14.65 17.7 1
0 18.10 0 0.655 6.209 65.4 2.9634 24 666 20.2 13.22 21.4 1
0 9.69 0 0.585 5.794 70.6 2.8927 6 391 19.2 14.10 18.3 0
0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 5.64 23.9 0

Final Conclusions

In this analysis, we developed and compared multiple logistic regression models to predict whether a neighborhood has a high crime rate. The exploratory data analysis showed that the dataset was clean and suitable for modeling, with no missing values and manageable correlations among variables after preprocessing.

To improve model stability, highly correlated predictors were removed, reducing multicollinearity and ensuring more reliable coefficient estimates. We then built three models: a baseline model with a single predictor, a full model with all predictors, and stepwise models using both forward and backward selection.

The results demonstrated that the stepwise selection approach provided the best-performing model, achieving the lowest AIC value while maintaining a balance between model complexity and predictive power. Interestingly, both forward and backward methods converged to the same final set of predictors (zn, rad, age, ptratio, indus, and rm), reinforcing the robustness of the selected model.

Using this final model, we generated predicted probabilities and classifications for the evaluation dataset. The model effectively distinguishes between high- and low-crime neighborhoods based on the selected features, making it a practical tool for classification.

Overall, this study highlights the importance of proper data preprocessing, careful variable selection, and model comparison. The final logistic regression model provides a reliable and interpretable approach for predicting crime risk, though future improvements could include testing alternative models or validating performance using additional datasets.