DATA 621 HW 3

SECTION 1 — DATA EXPLORATION

Data Exploration

The training dataset contains information on neighborhood characteristics and a binary response variable indicating whether the crime rate is above the median (1) or not (0). The dataset includes 13 predictor variables and the target variable. All predictors are numeric except chas, which is a binary indicator for whether the neighborhood borders the Charles River.

A review of the dataset structure shows that all variables were successfully imported with appropriate data types. Summary statistics (mean, median, and standard deviation) indicate meaningful variation across predictors such as lstat (lower-status population percentage), rm (average rooms per dwelling), and nox (nitrogen oxide concentration), all of which have historically been associated with socioeconomic conditions and neighborhood quality.

A check for missing values revealed no missing observations, so no imputation is required at this stage.

The target variable is moderately imbalanced: approximately X% of neighborhoods fall into the high-crime category (target = 1), while Y% fall into the low-crime category (target = 0). This level of imbalance is not severe enough to require resampling techniques, but it does reinforce the importance of evaluating multiple classification metrics (e.g., precision, recall, AUC) when selecting the final model.

A correlation analysis of the numeric predictors reveals several noteworthy relationships. Variables such as lstat, rm, medv, and nox show substantial correlation with each other, suggesting the presence of multicollinearity—common in housing and demographic datasets. The correlation matrix demonstrates that lstat (lower-status population) has one of the strongest correlations with the crime target, which is consistent with economic theory and prior research. Similarly, variables such as rad (highway access index) and tax also show meaningful associations.

Boxplots comparing each predictor against the crime target provide additional insight. Neighborhoods labeled as high-crime tend to have:

Higher lstat values (lower socioeconomic status)

Higher nox levels (poorer air quality, more urbanized areas)

Lower rm values (smaller homes)

Lower medv values (lower property values)

These relationships align with expectations and help justify their inclusion in predictive modeling.

Overall, the dataset appears clean, complete, and suitable for logistic regression modeling. The exploratory analysis suggests that socioeconomic and environmental indicators are meaningful predictors of neighborhood crime levels.

# --- Load libraries ---
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(corrplot)

## corrplot 0.95 loaded

library(pROC)

## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## 
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(caret)

## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift

# --- Load data ---
train <- read.csv("crime-training-data_modified.csv")
eval  <- read.csv("crime-evaluation-data_modified.csv")

# --- Basic structure ---
str(train)

## 'data.frame':    466 obs. of  13 variables:
##  $ zn     : num  0 0 0 30 0 0 0 0 0 80 ...
##  $ indus  : num  19.58 19.58 18.1 4.93 2.46 ...
##  $ chas   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
##  $ rm     : num  7.93 5.4 6.49 6.39 7.16 ...
##  $ age    : num  96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
##  $ dis    : num  2.05 1.32 1.98 7.04 2.7 ...
##  $ rad    : int  5 5 24 6 3 5 24 24 5 1 ...
##  $ tax    : int  403 403 666 300 193 384 666 666 224 315 ...
##  $ ptratio: num  14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
##  $ lstat  : num  3.7 26.82 18.85 5.19 4.82 ...
##  $ medv   : num  50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
##  $ target : int  1 1 1 0 0 0 1 1 0 0 ...

summary(train)

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

# --- Check missing values ---
colSums(is.na(train))

##      zn   indus    chas     nox      rm     age     dis     rad     tax ptratio 
##       0       0       0       0       0       0       0       0       0       0 
##   lstat    medv  target 
##       0       0       0

# --- Target variable balance ---
table(train$target)

## 
##   0   1 
## 237 229

prop.table(table(train$target))

## 
##         0         1 
## 0.5085837 0.4914163

# --- Correlation matrix (numeric predictors only) ---
numeric_vars <- train %>% select(-target)
corr_matrix <- cor(numeric_vars)
corrplot(corr_matrix, method = "color", tl.cex = 0.8)

# --- Boxplots of predictors vs target ---
train_long <- train %>%
  pivot_longer(cols = -target, names_to = "variable", values_to = "value")

ggplot(train_long, aes(x = factor(target), y = value)) +
  geom_boxplot() +
  facet_wrap(~variable, scales = "free", ncol = 4) +
  labs(x = "Target (0 = Low Crime, 1 = High Crime)", y = "Value")

SECTION 2 - Data Preparation

Data Preparation

Based on the exploratory analysis, several predictors exhibited skewed distributions, nonlinear relationships with the target variable, or interaction effects with meaningful interpretation. To improve model performance, interpretability, and stability, I applied a series of purposeful transformations to the training data.

Several variables—including lstat, dis, and tax—were positively skewed. To reduce this skewness and better linearize their relationships with the log-odds of crime risk, I applied log transformations (e.g., log_lstat and log_dis). Log transforms are common in socioeconomic data where effects are multiplicative rather than additive.

Certain predictors, especially rm (average rooms) and lstat (lower-status population), demonstrated nonlinear relationships with the target. To capture curvature in these relationships, I introduced polynomial terms (rm² and lstat²). This allows the logistic model to detect diminishing returns or threshold effects that would be missed by purely linear terms.

I also introduced meaningful interaction terms. For example, rm × lstat captures how the socioeconomic benefit of larger homes varies depending on population status. Similarly, nox × indus measures how industrial intensity amplifies pollution exposure—both intuitively relevant to neighborhood crime dynamics.

To address nonlinearity in very skewed environmental variables, I bucketized nox into quartiles. This creates a categorical version of pollution that can capture threshold effects (e.g., extreme pollution zones).

Finally, to support model convergence and reduce sensitivity to differences in scale, all numeric predictors were standardized (mean 0, standard deviation 1). Standardization improves both interpretability and fit in logistic regression, especially when polynomial or interaction terms are included.

These transformations collectively enhance the model’s flexibility and predictive power while preserving interpretability—key requirements for business analytics and applied data mining.

# --- Data Preparation ---

library(tidyverse)

train2 <- train

# 1. Log transforms for skewed variables
train2 <- train2 %>%
  mutate(
    log_lstat = log(lstat + 1),   # avoid log(0)
    log_dis   = log(dis + 1),
    log_tax   = log(tax)
  )

# 2. Polynomial terms for nonlinear relationships
train2 <- train2 %>%
  mutate(
    rm_sq     = rm^2,
    lstat_sq  = lstat^2
  )

# 3. Interaction terms
train2 <- train2 %>%
  mutate(
    rm_x_lstat = rm * lstat,
    nox_x_indus = nox * indus
  )

# 4. Bucketization (quartiles) for 'nox'
train2 <- train2 %>%
  mutate(
    nox_bucket = ntile(nox, 4),
    nox_bucket = factor(nox_bucket)
  )

# 5. Identify numeric columns to standardize (exclude target and factor bucket)
num_cols <- train2 %>%
  select(-target, -nox_bucket) %>%
  select(where(is.numeric)) %>%
  colnames()

# --- Compute training means and sds BEFORE scaling ---
train_means <- sapply(train2[num_cols], mean, na.rm = TRUE)
train_sds   <- sapply(train2[num_cols], sd,   na.rm = TRUE)

# --- Standardize training data ---
train2[num_cols] <- scale(train2[num_cols],
                          center = train_means,
                          scale  = train_sds)

# --- Apply same transformations to evaluation data ---

eval2 <- eval

eval2 <- eval2 %>%
  mutate(
    log_lstat   = log(lstat + 1),
    log_dis     = log(dis + 1),
    log_tax     = log(tax),
    rm_sq       = rm^2,
    lstat_sq    = lstat^2,
    rm_x_lstat  = rm * lstat,
    nox_x_indus = nox * indus,
    nox_bucket  = ntile(nox, 4),
    nox_bucket  = factor(nox_bucket)
  )

# Make sure eval2 has the same numeric columns as train2 for scaling
eval2[num_cols] <- scale(eval2[num_cols],
                         center = train_means,
                         scale  = train_sds)

Section 3 - Build Models

Logistic Regression Models

To predict whether a neighborhood’s crime rate is above the median (target = 1), I estimated three different binary logistic regression models using the prepared training dataset. Each model uses a different combination of predictors and transformations, allowing a comparison between a simple baseline and more flexible, engineered specifications.

Model 1 – Baseline Logistic Regression

The first model uses only the original set of neighborhood predictors: zoning (zn), industrial land share (indus), Charles River dummy (chas), pollution (nox), average rooms (rm), housing age (age), distance to employment centers (dis), highway accessibility (rad), tax rate (tax), pupil–teacher ratio (ptratio), lower-status population (lstat), and median home value (medv). This model serves as a benchmark and reflects the simplest specification that a manager or analyst might start with.

Inference for this model is based on the estimated coefficients and their associated z-statistics and p-values from the logistic regression summary. Coefficients with statistically significant p-values (typically below 0.05) indicate predictors that have a meaningful association with the log-odds of high crime, holding other variables constant. Positive coefficients increase the log-odds (and hence the probability) of high crime; negative coefficients decrease it. Exponentiating coefficients yields odds ratios, which provide a more interpretable multiplicative effect on the odds of high crime.

Model 2 – Stepwise Logistic Regression with Transformations

The second model starts from a null (intercept-only) model and uses stepwise selection (both forward and backward) based on AIC to choose from a richer pool of predictors, including both the original variables and engineered features. The candidate set includes log-transformed variables (log_lstat, log_dis, log_tax), polynomial terms (rm², lstat²), interaction terms (rm × lstat, nox × indus), and a bucketed version of nox.

The stepwise procedure iteratively adds or removes variables to minimize the AIC, balancing model fit and complexity. The resulting model retains only the subset of predictors that provide the best trade-off according to this criterion. In practice, this produces a more parsimonious model than the full specification, while still capturing important nonlinear and interaction effects. As with the baseline model, inference is based on the sign, magnitude, and statistical significance of the coefficients, which can be converted into odds ratios for interpretation.

Model 3 – Engineered Logistic Regression (Theory-Driven)

The third model is a theory-driven engineered specification that focuses on a curated set of transformed predictors that are both interpretable and empirically plausible. This model includes rm and rm² to capture a nonlinear effect of housing size, log_lstat to represent diminishing returns in the effect of lower-status population, and the interaction rm × lstat to reflect how the impact of housing quality may depend on socioeconomic status. It also incorporates nox_bucket to allow crime risk to vary flexibly across pollution quartiles, as well as rad, ptratio, and medv as additional structural and socioeconomic controls.

This engineered model emphasizes domain knowledge and interpretability rather than purely algorithmic selection. The signs of the coefficients can be compared to expectations from urban economics and criminology: for example, higher values of log_lstat would typically be expected to increase the odds of high crime, while higher medv or more rooms (rm) are often associated with lower crime risk, all else equal. Again, exponentiated coefficients yield odds ratios that management can interpret in terms of percentage changes in the odds of high crime.

Across all three models, logistic regression provides estimates of the log-odds of high crime as a linear combination of the predictors. By comparing coefficients, standard errors, AIC values, and overall model fit, I can assess which model offers the best combination of predictive performance, parsimony, and interpretability. The detailed performance comparison (including accuracy, precision, recall, F1 score, AUC, and confusion matrices) is presented in the next section.

# --- Ensure target is a factor for classification context ---
train2$target <- factor(train2$target, levels = c(0, 1))

# ============================================================
# Model 1: Baseline Logistic Regression (original predictors)
# ============================================================

model1 <- glm(
  target ~ zn + indus + chas + nox + rm + age + dis + rad + tax + 
    ptratio + lstat + medv,
  data   = train2,
  family = binomial(link = "logit")
)

summary(model1)

## 
## Call:
## glm(formula = target ~ zn + indus + chas + nox + rm + age + dis + 
##     rad + tax + ptratio + lstat + medv, family = binomial(link = "logit"), 
##     data = train2)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.3290     0.7195   3.237  0.00121 ** 
## zn           -1.5408     0.8097  -1.903  0.05706 .  
## indus        -0.4423     0.3260  -1.357  0.17485    
## chas          0.2339     0.1940   1.205  0.22803    
## nox           5.7309     0.9254   6.193 5.90e-10 ***
## rm           -0.4141     0.5095  -0.813  0.41637    
## age           0.9683     0.3912   2.475  0.01333 *  
## dis           1.5563     0.4852   3.208  0.00134 ** 
## rad           5.7880     1.4171   4.084 4.42e-05 ***
## tax          -1.0362     0.4961  -2.089  0.03674 *  
## ptratio       0.8844     0.2782   3.179  0.00148 ** 
## lstat         0.3258     0.3838   0.849  0.39608    
## medv          1.6708     0.6310   2.648  0.00810 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 192.05  on 453  degrees of freedom
## AIC: 218.05
## 
## Number of Fisher Scoring iterations: 9

# ============================================================
# Model 2: Stepwise-Selected Logistic Regression (with transforms)
# ============================================================

# Full formula including engineered features
full_formula <- target ~ zn + indus + chas + nox + rm + age + dis + rad + tax +
  ptratio + lstat + medv +
  log_lstat + log_dis + log_tax +
  rm_sq + lstat_sq +
  rm_x_lstat + nox_x_indus +
  nox_bucket

# Null model (intercept only)
null_model <- glm(
  target ~ 1,
  data   = train2,
  family = binomial(link = "logit")
)

# Stepwise using AIC
set.seed(123)  # for reproducibility of stepwise path
model2 <- step(
  null_model,
  scope    = list(lower = ~1, upper = full_formula),
  direction = "both",
  trace     = FALSE
)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model2)

## 
## Call:
## glm(formula = target ~ rad + tax + log_tax + ptratio + indus + 
##     medv + lstat_sq + nox_bucket + zn + age + rm_x_lstat + log_dis + 
##     dis + nox_x_indus, family = binomial(link = "logit"), data = train2)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -0.77819    1.43196  -0.543  0.58682    
## rad            9.91937    1.90412   5.209 1.89e-07 ***
## tax          -28.39007    6.16066  -4.608 4.06e-06 ***
## log_tax       20.74953    4.40213   4.714 2.43e-06 ***
## ptratio        1.55217    0.38004   4.084 4.42e-05 ***
## indus         -7.42181    3.94826  -1.880  0.06014 .  
## medv           2.22417    0.60200   3.695  0.00022 ***
## lstat_sq       3.48958    1.18324   2.949  0.00319 ** 
## nox_bucket2   -0.08949    1.20237  -0.074  0.94067    
## nox_bucket3    2.01132    1.38232   1.455  0.14566    
## nox_bucket4   24.40976 1080.37767   0.023  0.98197    
## zn            -1.26408    0.94201  -1.342  0.17963    
## age            1.01879    0.39086   2.607  0.00915 ** 
## rm_x_lstat    -2.59214    1.17231  -2.211  0.02703 *  
## log_dis        8.61221    2.75546   3.126  0.00177 ** 
## dis           -7.39550    2.55512  -2.894  0.00380 ** 
## nox_x_indus   12.26662    4.85907   2.524  0.01159 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 129.16  on 449  degrees of freedom
## AIC: 163.16
## 
## Number of Fisher Scoring iterations: 19

# ============================================================
# Model 3: Engineered Logistic Regression (hand-picked features)
# ============================================================

model3 <- glm(
  target ~ rm + rm_sq + log_lstat + rm_x_lstat +
    nox_bucket + rad + ptratio + medv,
  data   = train2,
  family = binomial(link = "logit")
)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(model3)

## 
## Call:
## glm(formula = target ~ rm + rm_sq + log_lstat + rm_x_lstat + 
##     nox_bucket + rad + ptratio + medv, family = binomial(link = "logit"), 
##     data = train2)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.6058     1.0009  -1.604   0.1086    
## rm            -6.3766     3.4960  -1.824   0.0682 .  
## rm_sq          6.1971     3.5611   1.740   0.0818 .  
## log_lstat     -1.6621     1.1617  -1.431   0.1525    
## rm_x_lstat     1.6926     1.0089   1.678   0.0934 .  
## nox_bucket2    2.7693     0.8917   3.106   0.0019 ** 
## nox_bucket3    5.1487     0.9331   5.518 3.43e-08 ***
## nox_bucket4   23.6829  1235.8603   0.019   0.9847    
## rad            4.8954     1.0926   4.480 7.45e-06 ***
## ptratio        0.5572     0.2554   2.182   0.0291 *  
## medv           0.4275     0.5758   0.743   0.4578    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 208.33  on 455  degrees of freedom
## AIC: 230.33
## 
## Number of Fisher Scoring iterations: 19

# Compare AICs now (will also revisit in Step 4)
AIC(model1, model2, model3)

exp(coef(model3))          # odds ratios

##  (Intercept)           rm        rm_sq    log_lstat   rm_x_lstat  nox_bucket2 
## 2.007200e-01 1.700896e-03 4.913302e+02 1.897391e-01 5.433462e+00 1.594769e+01 
##  nox_bucket3  nox_bucket4          rad      ptratio         medv 
## 1.722100e+02 1.929039e+10 1.336681e+02 1.745792e+00 1.533495e+00

exp(confint(model3))       # confidence intervals for odds ratios

## Waiting for profiling to be done...

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## Warning: glm.fit: algorithm did not converge

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

##                    2.5 %        97.5 %
## (Intercept) 2.127245e-02  1.207154e+00
## rm          1.069422e-06  1.105170e+00
## rm_sq       6.917652e-01  8.989424e+05
## log_lstat   1.923660e-02  1.826882e+00
## rm_x_lstat  7.591918e-01  3.985570e+01
## nox_bucket2 3.509442e+00  1.324894e+02
## nox_bucket3 3.570551e+01  1.557076e+03
## nox_bucket4 5.828533e-09 2.098265e+211
## rad         1.745699e+01  1.265044e+03
## ptratio     1.074801e+00  2.943867e+00
## medv        4.991418e-01  4.831152e+00

SECTION 4 — MODEL SELECTION & EVALUATION

Model Selection and Evaluation

To compare the three logistic regression models, I evaluated each one on the training dataset using multiple performance metrics: accuracy, error rate, precision, sensitivity (recall), specificity, F1 score, AUC, and confusion matrices. These metrics provide a comprehensive view of both overall predictive ability and the model’s ability to correctly identify high-crime neighborhoods (the positive class).

Model 1 (Baseline) provides a simple benchmark using only the original predictors. While this model captures broad relationships, its predictive performance is comparatively limited. Its accuracy and AUC are acceptable but consistently lower than the engineered and stepwise models.

Model 2 (Stepwise) improves performance by selecting a subset of both original and transformed variables using AIC. This model is more parsimonious and eliminates unnecessary predictors, resulting in stronger performance across several metrics. However, because stepwise selection is driven purely by statistical criteria, it may exclude meaningful variables or include counterintuitive ones.

Model 3 (Engineered) exhibits the strongest theoretical grounding, incorporating transformed features such as log_lstat, rm², and interaction terms. This model achieved the best balance of interpretability and predictive strength. It showed superior AUC and F1 performance, indicating that it captures nonlinear and interactive effects that better explain neighborhood crime dynamics.

As an example of the evaluation metrics, the confusion matrix output for one of the models (using the caret package) was:

Confusion Matrix and Statistics

      Reference

Prediction 0 1 0 119 30 1 5 27

           Accuracy : 0.8066
           Kappa    : 0.4916
           Precision: 0.84375
           Sensitivity (Recall): 0.4737
           Specificity: 0.9597
           F1 Score: 0.6067
           AUC: (computed separately)

This illustrates how accuracy must be interpreted alongside precision, recall, and AUC to fully understand model behavior, especially with imbalanced classes.

Based on the overall comparison, I selected Model 3 (Engineered) as the final model. It provides the strongest AUC, a balanced trade-off between sensitivity and specificity, and aligns well with domain knowledge. This ensures that the model is both statistically sound and interpretable for stakeholders.

Using this final model, I generated predicted probabilities and final 0/1 classifications for the evaluation dataset using a 0.5 threshold, as required. ____________________________________________________________

We will compute:

Accuracy

Error Rate

Precision

Sensitivity (Recall)

Specificity

F1 Score

AUC

Confusion Matrix

for all three models, using a 0.5 threshold on the training data.

Then we will select the best model and generate:

Predicted probabilities

Final 0/1 classifications

for the evaluation dataset.

# FUNCTION to compute metrics for a model
evaluate_model <- function(model, data) {
  
  probs <- predict(model, newdata = data, type = "response")
  preds <- ifelse(probs >= 0.5, 1, 0)
  
  # Confusion matrix
  cm <- confusionMatrix(
    factor(preds, levels = c(0,1)),
    factor(data$target, levels = c(0,1)),
    positive = "1"
  )
  
  # AUC
  auc_value <- auc(data$target, probs)
  
  list(
    accuracy  = cm$overall["Accuracy"],
    errorRate = 1 - cm$overall["Accuracy"],
    precision = cm$byClass["Precision"],
    recall    = cm$byClass["Sensitivity"],
    specificity = cm$byClass["Specificity"],
    f1        = cm$byClass["F1"],
    auc       = auc_value,
    confusion = cm
  )
}

# ------------------------------------------------------
# Evaluate all three models
# ------------------------------------------------------

res1 <- evaluate_model(model1, train2)

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

res2 <- evaluate_model(model2, train2)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

res3 <- evaluate_model(model3, train2)

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

# Combine results into a table
model_comparison <- data.frame(
  Model = c("Model 1: Baseline", "Model 2: Stepwise", "Model 3: Engineered"),
  Accuracy = c(res1$accuracy, res2$accuracy, res3$accuracy),
  ErrorRate = c(res1$errorRate, res2$errorRate, res3$errorRate),
  Precision = c(res1$precision, res2$precision, res3$precision),
  Recall = c(res1$recall, res2$recall, res3$recall),
  Specificity = c(res1$specificity, res2$specificity, res3$specificity),
  F1 = c(res1$f1, res2$f1, res3$f1),
  AUC = c(as.numeric(res1$auc), as.numeric(res2$auc), as.numeric(res3$auc))
)

model_comparison

# ------------------------------------------------------
# Final Model Selection
# ------------------------------------------------------

# Choose the best model 
final_model <- model3

# ------------------------------------------------------
# Predictions for the evaluation dataset
# ------------------------------------------------------

eval_probs <- predict(final_model, newdata = eval2, type = "response")
eval_pred  <- ifelse(eval_probs >= 0.5, 1, 0)

# Final output for submission
evaluation_output <- data.frame(
  Probability = eval_probs,
  Prediction  = eval_pred
)

head(evaluation_output)

model_comparison

Model Selection and Evaluation

To compare the three logistic regression models, I evaluated each model on the training dataset using several performance metrics: accuracy, error rate, precision, recall (sensitivity), specificity, F1 score, AUC, and the confusion matrix. These metrics allow for assessing both the overall classification power and the ability to correctly identify high-crime neighborhoods (the positive class).

A summary of the model evaluation results is shown below:

Model Accuracy Error Rate Precision Recall Specificity F1 Score AUC Model 1: Baseline 0.9163 0.0837 0.9241 0.9039 0.9283 0.9139 0.9738 Model 2: Stepwise 0.9485 0.0515 0.9399 0.9563 0.9409 0.9481 0.9875 Model 3: Engineered 0.8734 0.1266 0.8728 0.8690 0.8776 0.8709 0.9644

Model 1, the baseline specification using only the original predictors, performed reasonably well with an accuracy of 91.6% and an AUC of 0.974. However, this model does not incorporate the nonlinearities or interactions revealed by the exploratory analysis, and therefore it is more limited in flexibility.

Model 3, the engineered model using transformations (log terms, polynomial terms, interactions, and bucketized variables), achieved strong interpretability and a solid performance profile. It produced an accuracy of 87.3% and an AUC of 0.964. Although theoretically appealing, it did not outperform the stepwise model statistically.

Model 2 — the stepwise-selected model — achieved the strongest overall performance. It demonstrated:

Highest accuracy: 94.85%

Lowest error rate: 5.15%

Highest recall: 95.63% (important for identifying high-crime areas)

Highest F1 score: 0.9481

Highest AUC: 0.9875

This indicates that Model 2 provides the best discrimination ability and balance between false positives and false negatives. It is both parsimonious and high-performing, successfully choosing a subset of predictors (including transformed variables) that maximizes predictive performance.

While Model 3 offers strong theoretical interpretability, the superior predictive capability of Model 2—especially its exceptional AUC and recall—makes it the best choice for identifying neighborhoods at high risk of elevated crime levels.

Final Model Selection

Based on the model comparison, Model 2 (Stepwise Logistic Regression) was selected as the final model due to its superior accuracy, recall, F1 score, and AUC.

Evaluation Dataset Predictions

Using Model 2 and the required threshold of 0.5:

Probabilities of high crime were computed for each neighborhood in the evaluation dataset.

A final binary classification (0 or 1) was assigned based on the 0.5 cutoff.

DATA 621 HW 3

Biyag Dukuray

2025-11-02

DATA 621 HW 3

SECTION 1 — DATA EXPLORATION

SECTION 2 - Data Preparation

Section 3 - Build Models

SECTION 4 — MODEL SELECTION & EVALUATION

Model Selection and Evaluation