Homework-3-Group-2.knit

Column

Introduction

In this assignment, you will explore, analyze, and model a dataset containing crime information for neighborhoods in a major city. Each observation includes a response variable indicating whether the neighborhood’s crime rate is above the median (1) or not (0).

Your goal is to develop a binary logistic regression model using the training dataset to predict whether a neighborhood is at high risk for crime. You will then use this model to generate both predicted classifications and probabilities for the evaluation dataset. Only the variables provided (or those derived from them) may be used in your analysis. A brief description of the key variables is provided below.

Exploratory Data Analysis

In this section, we explore the dataset and become familiar with the features that will be used in the model-building process. The training dataset consists of 466 observations, including 1 target variable and 12 predictor variables. To begin, we examine the distribution of each predictor variable to identify any patterns or potential issues and determine the appropriate steps needed to build an effective model.

Distribution of Predictor Variables

train_df <- read.csv("crime-training-data_modified.csv")
eval_df  <- read.csv("crime-evaluation-data_modified.csv")

# Summary statistics: mean, median, sd
summary_stats <- train_df %>%
  summarise(across(where(is.numeric),
                   list(
                     mean = ~mean(.),
                     median = ~median(.),
                     sd = ~sd(.)
                   ),
                   .names = "{.col}_{.fn}"))

kable(t(summary_stats), col.names = "Value") %>%
  kable_styling(full_width = TRUE)

	Value
zn_mean	11.5772532
zn_median	0.0000000
zn_sd	23.3646511
indus_mean	11.1050215
indus_median	9.6900000
indus_sd	6.8458549
chas_mean	0.0708155
chas_median	0.0000000
chas_sd	0.2567920
nox_mean	0.5543105
nox_median	0.5380000
nox_sd	0.1166667
rm_mean	6.2906738
rm_median	6.2100000
rm_sd	0.7048513
age_mean	68.3675966
age_median	77.1500000
age_sd	28.3213784
dis_mean	3.7956929
dis_median	3.1909500
dis_sd	2.1069496
rad_mean	9.5300429
rad_median	5.0000000
rad_sd	8.6859272
tax_mean	409.5021459
tax_median	334.5000000
tax_sd	167.9000887
ptratio_mean	18.3984979
ptratio_median	18.9000000
ptratio_sd	2.1968447
lstat_mean	12.6314592
lstat_median	11.3500000
lstat_sd	7.1018907
medv_mean	22.5892704
medv_median	21.2000000
medv_sd	9.2396814
target_mean	0.4914163
target_median	0.0000000
target_sd	0.5004636

To better understand the distribution of the predictor variables, we computed summary statistics including the mean, median, and standard deviation for each numeric variable. The results indicate that several variables, such as tax and rad, exhibit higher variability, suggesting potential skewness or the presence of outliers. Additionally, differences between mean and median values for some predictors indicate that the data may not be perfectly symmetric. These insights help guide preprocessing decisions and model selection by highlighting variables that may require transformation or careful interpretation.

par(mfrow = c(4, 4), mar = c(3, 3, 1, 1))

for (col_name in names(train_df)) {
    hist(train_df[[col_name]], main = paste(col_name), xlab = "Value")
}

par(mfrow = c(1, 1))

kable(tidy(train_df), "pipe")

column	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
zn	466	11.5772532	23.3646511	0.00000	5.3542781	0.0000	0.0000	100.0000	100.0000	2.1838409	6.842914	1.0823466
indus	466	11.1050215	6.8458549	9.69000	10.9082353	6.3000	0.4600	27.7400	27.2800	0.2894763	1.764351	0.3171281
chas	466	0.0708155	0.2567920	0.00000	0.0000000	0.0000	0.0000	1.0000	1.0000	3.3462553	12.197425	0.0118957
nox	466	0.5543105	0.1166667	0.53800	0.5442684	0.0900	0.3890	0.8710	0.4820	0.7487369	2.976990	0.0054045
rm	466	6.2906738	0.7048513	6.21000	6.2570615	0.3485	3.8630	8.7800	4.9170	0.4808673	4.561996	0.0326516
age	466	68.3675966	28.3213784	77.15000	70.9553476	20.2500	2.9000	100.0000	97.1000	-0.5795721	1.998687	1.3119625
dis	466	3.7956929	2.1069496	3.19095	3.5443647	1.2913	1.1296	12.1265	10.9969	1.0021166	3.486917	0.0976026
rad	466	9.5300429	8.6859272	5.00000	8.6978610	1.0000	1.0000	24.0000	23.0000	1.0135395	2.147295	0.4023678
tax	466	409.5021459	167.9000887	334.50000	401.5080214	70.5000	187.0000	711.0000	524.0000	0.6614416	1.859928	7.7778214
ptratio	466	18.3984979	2.1968447	18.90000	18.5970588	1.3000	12.6000	22.0000	9.4000	-0.7567025	2.610831	0.1017669
lstat	466	12.6314592	7.1018907	11.35000	11.8809626	4.7700	1.7300	37.9700	36.2400	0.9085092	3.518453	0.3289887
medv	466	22.5892704	9.2396814	21.20000	21.6304813	4.0500	5.0000	50.0000	45.0000	1.0801670	4.392615	0.4280200
target	466	0.4914163	0.5004636	0.00000	0.4893048	0.0000	0.0000	1.0000	1.0000	0.0343398	1.001179	0.0231835

show_summary <- function(df) {
    cat(rep("+", 50), "\n")
    cat(paste("DIMENSIONS : (", nrow(df), ", ", ncol(df), ")\n", sep = ""), "\n")
    cat(rep("+", 50), "\n")
    cat("COLUMNS:\n", "\n")
    cat(names(df), "\n")
    cat(rep("+", 50), "\n")
    cat("DATA INFO:\n", "\n")
    cat(sapply(df, class), "\n")
    cat(rep("+", 50), "\n")
    cat("MISSING VALUES:\n", "\n")
    cat(colSums(is.na(df)), "\n")
}

show_summary(train_df)

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
DIMENSIONS : (466, 13)
 
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
COLUMNS:
 
zn indus chas nox rm age dis rad tax ptratio lstat medv target 
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
DATA INFO:
 
numeric numeric integer numeric numeric numeric numeric integer integer numeric numeric numeric integer 
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + 
MISSING VALUES:
 
0 0 0 0 0 0 0 0 0 0 0 0 0

The dataset appears to be relatively clean, with no missing values observed. All explanatory variables are in the appropriate data types. The next step is to examine the data for multicollinearity among the predictors.

# correlation plot
correlation_Matrix <- cor(train_df[, 1:12])

corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
    number.cex = 0.7)

To address multicollinearity, we will remove variables with high correlations, using a threshold of 0.7. The correlation analysis shows that dis is highly negatively correlated with indus, nox, and age. Based on the correlation matrix, we will remove the variables dis, tax, and medv to reduce multicollinearity and improve model stability.

train_df <- train_df |>
    mutate(crime = ifelse(target == 1, "high", "low"))

df_2 <- train_df |>
    select(indus, nox, age, dis, crime)
ggpairs(data = df_2, columns = 1:4, ggplot2::aes(color = crime))

train_df <- train_df |>
    select(-dis, -tax, -medv, -nox)

Data Preprocessing

During preprocessing, we examined correlations among predictor variables to identify multicollinearity. Variables such as dis, tax, medv, and nox showed high correlations (above 0.7) with other predictors, which can lead to unstable coefficient estimates in logistic regression. Therefore, these variables were removed to improve model stability and interpretability.

After removal, the correlation structure was rechecked and showed reduced multicollinearity. Additionally, no missing values were found in the dataset, so no imputation was required, and all variables were already in appropriate numeric format.

correlation_Matrix <- cor(train_df[, 1:8])

corrplot(correlation_Matrix, method = "color", type = "upper", addCoef.col = "black",
    number.cex = 0.7)

train_df <- train_df |>
    select(-crime)

Model Development

We start off with a simple logistic model and then we can work our way up to build the best fitting model.

Baseline Logistic Regression Model

With this model, we will set a base AIC value of 522.46 with only one predictor variable in this case zn variable/feature. We expect the first model to having the worst AIC value.

simple_model <- glm(target ~ zn, data = train_df, family = "binomial")
summary(simple_model)


Call:
glm(formula = target ~ zn, family = "binomial", data = train_df)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  0.54503    0.11170   4.879 1.06e-06 ***
zn          -0.09176    0.01349  -6.804 1.02e-11 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 518.46  on 464  degrees of freedom
AIC: 522.46

Number of Fisher Scoring iterations: 6

Full Model (All Predictors)

For the second model, we will include all the predictor variables that are linearly independent. We will get a better AIC value of 292 which might not be the best.

everything_model <- glm(target ~ ., data = train_df, family = "binomial")
summary(everything_model)


Call:
glm(formula = target ~ ., family = "binomial", data = train_df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -8.640771   3.035549  -2.847  0.00442 ** 
zn          -0.056399   0.019159  -2.944  0.00324 ** 
indus        0.049239   0.027236   1.808  0.07063 .  
chas         0.168888   0.584261   0.289  0.77253    
rm           0.693054   0.349282   1.984  0.04723 *  
age          0.035598   0.008583   4.148 3.36e-05 ***
rad          0.483674   0.113763   4.252 2.12e-05 ***
ptratio     -0.102886   0.071228  -1.444  0.14861    
lstat        0.043899   0.041213   1.065  0.28680    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 276.49  on 457  degrees of freedom
AIC: 294.49

Number of Fisher Scoring iterations: 8

After building the full model, we used the vif() function to assess multicollinearity among the predictors. The results indicate that the predictor variables exhibit relatively low multicollinearity.

data.frame(vif(everything_model))

        vif.everything_model.
zn                   1.464757
indus                1.494374
chas                 1.063072
rm                   2.501189
age                  1.560749
rad                  1.142231
ptratio              1.289813
lstat                2.698502

Backward Stepwise Selection

step(everything_model, direction = "backward", scope = formula(everything_model))

Start:  AIC=294.49
target ~ zn + indus + chas + rm + age + rad + ptratio + lstat

          Df Deviance    AIC
- chas     1   276.57 292.57
- lstat    1   277.62 293.62
<none>         276.49 294.49
- ptratio  1   278.61 294.61
- indus    1   279.81 295.81
- rm       1   280.49 296.49
- zn       1   288.67 304.67
- age      1   297.28 313.28
- rad      1   367.79 383.79

Step:  AIC=292.57
target ~ zn + indus + rm + age + rad + ptratio + lstat

          Df Deviance    AIC
- lstat    1   277.73 291.73
<none>         276.57 292.57
- ptratio  1   278.95 292.95
- indus    1   279.99 293.99
- rm       1   280.57 294.57
- zn       1   289.10 303.10
- age      1   297.37 311.37
- rad      1   368.22 382.22

Step:  AIC=291.73
target ~ zn + indus + rm + age + rad + ptratio

          Df Deviance    AIC
<none>         277.73 291.73
- ptratio  1   280.04 292.04
- rm       1   280.74 292.74
- indus    1   281.64 293.64
- zn       1   289.48 301.48
- age      1   312.25 324.25
- rad      1   373.55 385.55


Call:  glm(formula = target ~ zn + indus + rm + age + rad + ptratio, 
    family = "binomial", data = train_df)

Coefficients:
(Intercept)           zn        indus           rm          age          rad  
   -6.83783     -0.05417      0.05294      0.43959      0.03994      0.48680  
    ptratio  
   -0.10514  

Degrees of Freedom: 465 Total (i.e. Null);  459 Residual
Null Deviance:      645.9 
Residual Deviance: 277.7    AIC: 291.7

Forward Stepwise Selection

step(simple_model, direction = "forward", scope = formula(everything_model))

Start:  AIC=522.46
target ~ zn

          Df Deviance    AIC
+ rad      1   344.89 350.89
+ age      1   407.45 413.45
+ indus    1   432.03 438.03
+ lstat    1   473.70 479.70
+ chas     1   516.06 522.06
<none>         518.46 522.46
+ ptratio  1   517.42 523.42
+ rm       1   518.37 524.37

Step:  AIC=350.89
target ~ zn + rad

          Df Deviance    AIC
+ age      1   286.93 294.93
+ indus    1   325.11 333.11
+ ptratio  1   334.65 342.65
+ lstat    1   336.31 344.31
+ chas     1   342.50 350.50
+ rm       1   342.80 350.80
<none>         344.89 350.89

Step:  AIC=294.93
target ~ zn + rad + age

          Df Deviance    AIC
+ ptratio  1   283.11 293.11
+ rm       1   284.19 294.19
+ indus    1   284.80 294.80
<none>         286.93 294.93
+ chas     1   286.28 296.28
+ lstat    1   286.80 296.80

Step:  AIC=293.11
target ~ zn + rad + age + ptratio

        Df Deviance    AIC
+ indus  1   280.74 292.74
<none>       283.11 293.11
+ rm     1   281.64 293.64
+ chas   1   282.94 294.94
+ lstat  1   283.11 295.11

Step:  AIC=292.74
target ~ zn + rad + age + ptratio + indus

        Df Deviance    AIC
+ rm     1   277.73 291.73
<none>       280.74 292.74
+ lstat  1   280.57 294.57
+ chas   1   280.67 294.67

Step:  AIC=291.73
target ~ zn + rad + age + ptratio + indus + rm

        Df Deviance    AIC
<none>       277.73 291.73
+ lstat  1   276.57 292.57
+ chas   1   277.62 293.62


Call:  glm(formula = target ~ zn + rad + age + ptratio + indus + rm, 
    family = "binomial", data = train_df)

Coefficients:
(Intercept)           zn          rad          age      ptratio        indus  
   -6.83783     -0.05417      0.48680      0.03994     -0.10514      0.05294  
         rm  
    0.43959  

Degrees of Freedom: 465 Total (i.e. Null);  459 Residual
Null Deviance:      645.9 
Residual Deviance: 277.7    AIC: 291.7

With both forward and backward selection, both logit model performs similar with having AIC value of 291.7. Also, both direction gave the formula for target ~ zn + rad + age+ ptratio + indus + rm.

Model Evaluation and Selection

The stepwise model has the lowest AIC value, indicating it provides the best fit with the least prediction error among the models considered. As expected, the first model performed the worst, as it failed to capture much of the data’s variability. The second model, which included all variables, showed a noticeable improvement in AIC but was still not optimal. The third model, developed using stepwise selection, achieved the best performance; both forward selection and backward elimination led to the same final model, which was somewhat surprising. Based on these results, the stepwise model is the most appropriate choice, and it will be used to predict the target variable in the evaluation dataset.

step_mod <- glm(target ~ zn + rad + age + ptratio + indus + rm, family = binomial,
    data = train_df)

summary(step_mod)


Call:
glm(formula = target ~ zn + rad + age + ptratio + indus + rm, 
    family = binomial, data = train_df)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -6.837830   2.547577  -2.684  0.00727 ** 
zn          -0.054166   0.018564  -2.918  0.00353 ** 
rad          0.486803   0.111965   4.348 1.38e-05 ***
age          0.039941   0.007694   5.191 2.09e-07 ***
ptratio     -0.105136   0.069678  -1.509  0.13133    
indus        0.052943   0.027031   1.959  0.05016 .  
rm           0.439592   0.257184   1.709  0.08740 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 277.73  on 459  degrees of freedom
AIC: 291.73

Number of Fisher Scoring iterations: 8

The final logistic regression model includes the predictors zn, rad, age, ptratio, indus, and rm. The coefficients represent changes in the log-odds of a neighborhood being classified as high crime.

Variables such as rm and zn have negative coefficients, indicating that higher values are associated with lower crime risk. In contrast, indus, rad, age, and ptratio have positive coefficients, suggesting that higher values increase the likelihood of high crime.

Overall, the coefficient signs are intuitive and align with expected relationships.

predictions <- predict(step_mod, eval_df, type = "response")

threshold <- 0.5

binary_predictions <- ifelse(predictions >= threshold, 1, 0)

eval_df$target_pred <- binary_predictions
model_names <- c("simple", "everything", "step")
aic_values <- c(simple_model$aic, everything_model$aic, step_mod$aic)

kable(cbind(model_names, aic_values), col.names = c("Model Name", "AIC Value")) |>
    kable_styling(full_width = T)

Model Name	AIC Value
simple	522.464452017651
everything	294.486574605442
step	291.726687224588

library(caret)
library(pROC)

# Convert to factor for evaluation
actual <- as.factor(eval_df$target)
predicted <- as.factor(binary_predictions)

# Confusion Matrix
conf_mat <- confusionMatrix(predicted, actual)
conf_mat

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 23  0
         1  0 17
                                     
               Accuracy : 1          
                 95% CI : (0.9119, 1)
    No Information Rate : 0.575      
    P-Value [Acc > NIR] : 2.436e-10  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : 1.000      
            Specificity : 1.000      
         Pos Pred Value : 1.000      
         Neg Pred Value : 1.000      
             Prevalence : 0.575      
         Detection Rate : 0.575      
   Detection Prevalence : 0.575      
      Balanced Accuracy : 1.000      
                                     
       'Positive' Class : 0

# Extract key metrics
accuracy  <- conf_mat$overall['Accuracy']
precision <- conf_mat$byClass['Precision']
recall    <- conf_mat$byClass['Sensitivity']
specificity <- conf_mat$byClass['Specificity']
f1_score  <- conf_mat$byClass['F1']

# ROC and AUC
roc_obj <- roc(actual, predictions)
auc_value <- auc(roc_obj)

# Print results
metrics <- data.frame(
  Metric = c("Accuracy", "Precision", "Recall (Sensitivity)", 
             "Specificity", "F1 Score", "AUC"),
  Value = c(accuracy, precision, recall, specificity, f1_score, auc_value)
)

kable(metrics) %>%
  kable_styling(full_width = TRUE)

	Metric	Value
Accuracy	Accuracy	1
Precision	Precision	1
Sensitivity	Recall (Sensitivity)	1
Specificity	Specificity	1
F1	F1 Score	1
	AUC	1

# Plot ROC Curve
plot(roc_obj, main = "ROC Curve")

To evaluate the performance of the final logistic regression model, we used several classification metrics. The confusion matrix summarizes the model’s prediction accuracy by comparing predicted and actual values. The model achieved strong performance in terms of accuracy, while precision and recall indicate a good balance between correctly identifying high-crime areas and minimizing false positives.

The F1 score further confirms this balance between precision and recall. Additionally, the ROC curve and AUC value show the model’s ability to distinguish between high- and low-crime neighborhoods. Overall, the evaluation metrics suggest that the selected model performs well and is suitable for prediction.

Prediction Results

kable(eval_df) |>
    kable_styling(full_width = T)

zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	lstat	medv	target_pred
0	7.07	0	0.469	7.185	61.1	4.9671	2	242	17.8	4.03	34.7	0
0	8.14	0	0.538	6.096	84.5	4.4619	4	307	21.0	10.26	18.2	0
0	8.14	0	0.538	6.495	94.4	4.4547	4	307	21.0	12.80	18.4	0
0	8.14	0	0.538	5.950	82.0	3.9900	4	307	21.0	27.71	13.2	0
0	5.96	0	0.499	5.850	41.5	3.9342	5	279	19.2	8.77	21.0	0
25	5.13	0	0.453	5.741	66.2	7.2254	8	284	19.7	13.15	18.7	0
25	5.13	0	0.453	5.966	93.4	6.8185	8	284	19.7	14.44	16.0	1
0	4.49	0	0.449	6.630	56.1	4.4377	3	247	18.5	6.53	26.6	0
0	4.49	0	0.449	6.121	56.8	3.7476	3	247	18.5	8.44	22.2	0
0	2.89	0	0.445	6.163	69.6	3.4952	2	276	18.0	11.34	21.4	0
0	25.65	0	0.581	5.856	97.0	1.9444	2	188	19.1	25.41	17.3	0
0	25.65	0	0.581	5.613	95.6	1.7572	2	188	19.1	27.26	15.7	0
0	21.89	0	0.624	5.637	94.7	1.9799	4	437	21.2	18.34	14.3	1
0	19.58	0	0.605	6.101	93.0	2.2834	5	403	14.7	9.81	25.0	1
0	19.58	0	0.605	5.880	97.3	2.3887	5	403	14.7	12.03	19.1	1
0	10.59	1	0.489	5.960	92.1	3.8771	4	277	18.6	17.27	21.7	1
0	6.20	0	0.504	6.552	21.4	3.3751	8	307	17.4	3.76	31.5	0
0	6.20	0	0.507	8.247	70.4	3.6519	8	307	17.4	3.95	48.3	1
22	5.86	0	0.431	6.957	6.8	8.9067	7	330	19.1	3.53	29.6	0
90	2.97	0	0.400	7.088	20.8	7.3073	1	285	15.3	7.85	32.2	0
80	1.76	0	0.385	6.230	31.5	9.0892	1	241	18.2	12.93	20.1	0
33	2.18	0	0.472	6.616	58.1	3.3700	7	222	18.4	8.93	28.4	0
0	9.90	0	0.544	6.122	52.8	2.6403	4	304	18.4	5.98	22.1	0
0	7.38	0	0.493	6.415	40.1	4.7211	5	287	19.6	6.12	25.0	0
0	7.38	0	0.493	6.312	28.9	5.4159	5	287	19.6	6.15	23.0	0
0	5.19	0	0.515	5.895	59.6	5.6150	5	224	20.2	10.56	18.5	0
80	2.01	0	0.435	6.635	29.7	8.3440	4	280	17.0	5.99	24.5	0
0	18.10	0	0.718	3.561	87.9	1.6132	24	666	20.2	7.12	27.5	1
0	18.10	1	0.631	7.016	97.5	1.2024	24	666	20.2	2.96	50.0	1
0	18.10	0	0.584	6.348	86.1	2.0527	24	666	20.2	17.64	14.5	1
0	18.10	0	0.740	5.935	87.9	1.8206	24	666	20.2	34.02	8.4	1
0	18.10	0	0.740	5.627	93.9	1.8172	24	666	20.2	22.88	12.8	1
0	18.10	0	0.740	5.818	92.4	1.8662	24	666	20.2	22.11	10.5	1
0	18.10	0	0.740	6.219	100.0	2.0048	24	666	20.2	16.59	18.4	1
0	18.10	0	0.740	5.854	96.6	1.8956	24	666	20.2	23.79	10.8	1
0	18.10	0	0.713	6.525	86.5	2.4358	24	666	20.2	18.13	14.1	1
0	18.10	0	0.713	6.376	88.4	2.5671	24	666	20.2	14.65	17.7	1
0	18.10	0	0.655	6.209	65.4	2.9634	24	666	20.2	13.22	21.4	1
0	9.69	0	0.585	5.794	70.6	2.8927	6	391	19.2	14.10	18.3	0
0	11.93	0	0.573	6.976	91.0	2.1675	1	273	21.0	5.64	23.9	0

Final Conclusions

In this analysis, we developed and compared multiple logistic regression models to predict whether a neighborhood has a high crime rate. The exploratory data analysis showed that the dataset was clean and suitable for modeling, with no missing values and manageable correlations among variables after preprocessing.

To improve model stability, highly correlated predictors were removed, reducing multicollinearity and ensuring more reliable coefficient estimates. We then built three models: a baseline model with a single predictor, a full model with all predictors, and stepwise models using both forward and backward selection.

The results demonstrated that the stepwise selection approach provided the best-performing model, achieving the lowest AIC value while maintaining a balance between model complexity and predictive power. Interestingly, both forward and backward methods converged to the same final set of predictors (zn, rad, age, ptratio, indus, and rm), reinforcing the robustness of the selected model.

Using this final model, we generated predicted probabilities and classifications for the evaluation dataset. The model effectively distinguishes between high- and low-crime neighborhoods based on the selected features, making it a practical tool for classification.

Overall, this study highlights the importance of proper data preprocessing, careful variable selection, and model comparison. The final logistic regression model provides a reliable and interpretable approach for predicting crime risk, though future improvements could include testing alternative models or validating performance using additional datasets.

DATA 621 - Homework 3 —- 11 Apr 2026

Bikash Bhowmik,Rupendra Shrestha,Anthony Josue Roman, Jerald Melukkaran

Column

Column