Introduction

The following RMD contains CUNY SPS DATA 621 Fall 2025 context for the Homework 03 assignment.

Data Exploration

Our crime dataset includes 13 features across Boston suburbs, capturing housing, environmental, and socioeconomic factors. The target variable, High_Crime_Area, indicates whether a suburb’s crime rate is above or below the median, allowing us to explore which neighborhood attributes are most associated with higher crime.

Load Libraries

library(tidyverse)
library(ggcorrplot)
library(GGally)
library(ROCR)
library(dplyr)

Download and describe the crime training data set

# Import the provided data
crime_raw <- read_csv("https://raw.githubusercontent.com/evanskaylie/DATA621/refs/heads/main/crime-training-data_modified.csv")

# Rename the features
crime_df <- crime_raw |>
  rename(
    Large_Lot_Zone_Pct = zn, # proportion of residential land zoned for large lots (over 25000 square feet) 
    Industrial_Land_Pct = indus, # proportion of non-retail business acres per suburb
    Borders_Charles_River = chas, # a dummy var. for whether the suburb borders the Charles River (1) or not (0) 
    NOx_Concentration = nox, # nitrogen oxides concentration (parts per 10 million) 
    Avg_Rooms_Per_Dwelling = rm, # average number of rooms per dwelling 
    Old_Homes_Pct = age, # proportion of owner-occupied units built prior to 1940
    Distance_to_Employment_Centers = dis, # weighted mean of distances to five Boston employment centers 
    Highway_Access_Index = rad, # index of accessibility to radial highways 
    Property_Tax_Rate = tax, # full-value property-tax rate per $10,000
    Pupil_Teacher_Ratio = ptratio, # pupil-teacher ratio by town
    Low_Status_Pop_Pct = lstat, # lower status of the population (percent) 
    Median_Home_Value = medv, # median value of owner-occupied homes in $1000s
    High_Crime_Area = target # whether the crime rate is above the median crime rate (1) or not (0)
  )

# Count missing vals and summarize the provided data
colSums(is.na(crime_df))
##             Large_Lot_Zone_Pct            Industrial_Land_Pct 
##                              0                              0 
##          Borders_Charles_River              NOx_Concentration 
##                              0                              0 
##         Avg_Rooms_Per_Dwelling                  Old_Homes_Pct 
##                              0                              0 
## Distance_to_Employment_Centers           Highway_Access_Index 
##                              0                              0 
##              Property_Tax_Rate            Pupil_Teacher_Ratio 
##                              0                              0 
##             Low_Status_Pop_Pct              Median_Home_Value 
##                              0                              0 
##                High_Crime_Area 
##                              0
summary(crime_df)
##  Large_Lot_Zone_Pct Industrial_Land_Pct Borders_Charles_River NOx_Concentration
##  Min.   :  0.00     Min.   : 0.460      Min.   :0.00000       Min.   :0.3890   
##  1st Qu.:  0.00     1st Qu.: 5.145      1st Qu.:0.00000       1st Qu.:0.4480   
##  Median :  0.00     Median : 9.690      Median :0.00000       Median :0.5380   
##  Mean   : 11.58     Mean   :11.105      Mean   :0.07082       Mean   :0.5543   
##  3rd Qu.: 16.25     3rd Qu.:18.100      3rd Qu.:0.00000       3rd Qu.:0.6240   
##  Max.   :100.00     Max.   :27.740      Max.   :1.00000       Max.   :0.8710   
##  Avg_Rooms_Per_Dwelling Old_Homes_Pct    Distance_to_Employment_Centers
##  Min.   :3.863          Min.   :  2.90   Min.   : 1.130                
##  1st Qu.:5.887          1st Qu.: 43.88   1st Qu.: 2.101                
##  Median :6.210          Median : 77.15   Median : 3.191                
##  Mean   :6.291          Mean   : 68.37   Mean   : 3.796                
##  3rd Qu.:6.630          3rd Qu.: 94.10   3rd Qu.: 5.215                
##  Max.   :8.780          Max.   :100.00   Max.   :12.127                
##  Highway_Access_Index Property_Tax_Rate Pupil_Teacher_Ratio Low_Status_Pop_Pct
##  Min.   : 1.00        Min.   :187.0     Min.   :12.6        Min.   : 1.730    
##  1st Qu.: 4.00        1st Qu.:281.0     1st Qu.:16.9        1st Qu.: 7.043    
##  Median : 5.00        Median :334.5     Median :18.9        Median :11.350    
##  Mean   : 9.53        Mean   :409.5     Mean   :18.4        Mean   :12.631    
##  3rd Qu.:24.00        3rd Qu.:666.0     3rd Qu.:20.2        3rd Qu.:16.930    
##  Max.   :24.00        Max.   :711.0     Max.   :22.0        Max.   :37.970    
##  Median_Home_Value High_Crime_Area 
##  Min.   : 5.00     Min.   :0.0000  
##  1st Qu.:17.02     1st Qu.:0.0000  
##  Median :21.20     Median :0.0000  
##  Mean   :22.59     Mean   :0.4914  
##  3rd Qu.:25.00     3rd Qu.:1.0000  
##  Max.   :50.00     Max.   :1.0000

There are no missing values found in the data set, so no imputation is necessary

Most of the data shows older homes (median 77%), moderate pollution levels (median NOx ~0.54), and varying highway access, all of which show strong potential links to high-crime areas. Key outliers include a few suburbs with very high property taxes, large lots, or extreme highway access, which may also influence crime patterns.

# Standardize numeric features
crime_standardized <- crime_df |>
  select(where(is.numeric)) |>
  mutate(across(everything(), scale)) |>
  pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")

# Feature distribution violin plots
crime_standardized |>
  ggplot(aes(x = Feature, y = Value)) +
  geom_violin(fill = "steelblue", alpha = 0.6) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(
      angle = 45,
      hjust = 1,
      face = ifelse(levels(factor(crime_standardized$Feature)) == "High_Crime_Area", "bold", "plain")
    )
  ) +
  labs(
    title = "Data Snapshot: Distribution of Standardized Features",
    subtitle = "High_Crime_Area highlighted in bold",
    x = NULL,
    y = "Standardized Value"
  )

All numeric predictors standardized for comparison; binary target is excluded from violin plot. These plots show the distribution of values across each feature. Standardization is for visualization clarity; logistic regression uses raw values. These distributions show that some predictors are skewed, which may influence the model’s sensitivity to extreme values and help us understand where transformations or careful interpretation might be needed.

# Compute correlation matrix
corr <- crime_df |> 
  select(where(is.numeric)) |> 
  cor(use = "complete.obs")

# Compute correlations with target
cor_with_target <- corr["High_Crime_Area", ] |> 
  enframe(name = "Feature", value = "Correlation") |> 
  filter(Feature != "High_Crime_Area") |> 
  arrange(desc(Correlation))

# Reorder the correlation matrix by target correlation
feature_order <- cor_with_target$Feature
corr_ordered <- corr[c("High_Crime_Area", feature_order), c("High_Crime_Area", feature_order)]

# Correlation heat plot
ggcorrplot(corr_ordered, 
           method = "circle", 
           type = "lower", 
           lab = TRUE, 
           lab_size = 3, 
           colors = c("salmon", "white", "steelblue"), 
           title = "Correlation Matrix", 
           ggtheme = theme_minimal() +
             theme(
               axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
               axis.text.y = element_text(size = 8),
               panel.spacing = unit(0.1, "lines")  # tighter spacing between panels
             )) +
          coord_fixed(ratio = .8)

The most important part of this graph is the bottom row. This shows each predictor’s correlation with the High_Crime_Area values. By ranking features by absolute value of correlations with High_Crime_Area, we can see that NOx_Concentration, Old_Homes_Pct, and Highway_Access_Index are the top predictors, with Distance_to_Employment_Centers has a strong negative relationship

# Get the features with the most correlation
top_features <- c("NOx_Concentration",
                  "Old_Homes_Pct",
                  "Highway_Access_Index",
                  "Distance_to_Employment_Centers")

# Visualized jittered scatter plots
crime_df |>
  select(all_of(top_features), High_Crime_Area) |>
  pivot_longer(cols = -High_Crime_Area, names_to = "Feature", values_to = "Value") |>
  ggplot(aes(x = Value, y = factor(High_Crime_Area), color = factor(High_Crime_Area))) +
  geom_jitter(height = 0.1, alpha = 0.6) +
  facet_wrap(~ Feature, scales = "free_x") +
  theme_minimal() +
  labs(title = "Predictor vs High Crime Area",
       y = "High Crime Area (0 = No, 1 = Yes)",
       x = "Feature Value",
       color = "High Crime Area")

These scatter plots show the distribution of highly correlated features across areas considered lower crime (0 for High Crime Area), and areas considered higher crime (1 for High Crime Area). In all four of these plots, the distribution of values varies clearly between lower and higher crime areas.

The plots show clear differences between high and low crime suburbs:

High crime areas tend to:

  • be closer to employment centers

  • have higher highway access (with some extreme outliers)

  • have higher NOx concentrations

  • be a larger proportion of older homes.

Low-crime areas tend to:

  • be more evenly distributed across these features.

These patterns suggest that proximity to employment, transportation access, pollution levels, and housing age may be important indicators of high-crime areas.

Other features show weaker relationships with high crime areas and are not visualized here to keep the story clear.

Data Preparation

Data Transfomations + Feature Engineering

Renaming variables: All features were given clear names, like changing rm to Avg_Rooms_Per_Dwelling. This makes tables and plots easier to read and interpret.

Standardized numbers: All numeric features were scaled so they have the same range (mean 0, standard deviation 1) for the comparisons in the violin plots. This makes for fair comparisons of distributions, even though the original features are on different scales.

Feature selection: The features most strongly linked to high crime were chosen for the jitter correlation plots. This helps show clear patterns in the plots without showing all 12 features.

Build Models

In this section, three different binary logistic regression models are fit to predict whether a suburb is a high-crime area. Each model uses different features or transformations to highlight potential drivers of high crime.

Model 1: Using the four most related features: NOx_Concentration, Old_Homes_Pct, Highway_Access_Index, and Distance_to_Employment_Centers. This is to see how the most correlated features relate to crime.

Model 2: Stepwise selection added socioeconomic factors like Property_Tax_Rate, Pupil_Teacher_Ratio, and Median_Home_Value, which improved predictions and included community and environmental information.

Model 3: Combined some related features (Highway Access × Property Tax and Old Homes × NOx) to look at how features work together to affect crime.

Model 1/3: Top Correlation Predictors

# Predict on top correlation features
model1 <- glm(High_Crime_Area ~ NOx_Concentration +
                Old_Homes_Pct +
                Highway_Access_Index + 
                Distance_to_Employment_Centers,
                data = crime_df, family = binomial
              )

# View model details
summary(model1)
## 
## Call:
## glm(formula = High_Crime_Area ~ NOx_Concentration + Old_Homes_Pct + 
##     Highway_Access_Index + Distance_to_Employment_Centers, family = binomial, 
##     data = crime_df)
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -20.224999   3.199415  -6.321 2.59e-10 ***
## NOx_Concentration               28.296396   5.071309   5.580 2.41e-08 ***
## Old_Homes_Pct                    0.017349   0.009008   1.926   0.0541 .  
## Highway_Access_Index             0.521295   0.109765   4.749 2.04e-06 ***
## Distance_to_Employment_Centers   0.239127   0.147455   1.622   0.1049    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 234.17  on 461  degrees of freedom
## AIC: 244.17
## 
## Number of Fisher Scoring iterations: 8
# Add predicted probabilities for plot
model_1_crime <- crime_df |> 
  mutate(Model1_Pred = predict(model1, type = "response"))

# Plot predicted probability vs top features
model_1_crime |> 
  pivot_longer(cols = all_of(top_features), names_to = "Feature", values_to = "Value") |>
  ggplot(aes(x = Value, y = Model1_Pred)) +
  geom_point(aes(y = High_Crime_Area), alpha = 0.3, color = "steelblue") +
  geom_smooth(se = FALSE, color = "salmon", size = 1, method = "loess") +
  facet_wrap(~ Feature, scales = "free_x") +
  labs(title = "Model 1: Predicted Probability vs Top Correlated Features",
       y = "Predicted Probability of High Crime",
       x = "Feature Value") +
  theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

NOx_Concentration and Highway_Access_Index are strongly positive and highly significant predictors (p < 0.001). This means suburbs with higher pollution levels or better highway access are much more likely to be high-crime areas.

Old_Homes_Pct is somewhat significant (p ~ 0.054), suggesting older housing may slightly increase crime risk.

Distance_to_Employment_Centers is not statistically significant (p ~ 0.10), indicating it has a weaker or less clear effect on crime in this model.

Model 2/3: Stepwise Based on AIC for Predictor Selection

# Get full model
full_model <- glm(High_Crime_Area ~ . , data = crime_df, family = binomial)

# Do silent stepwise selection
model2 <- step(full_model, direction = "both", trace = 0) 

# View model details
summary(model2)
## 
## Call:
## glm(formula = High_Crime_Area ~ Large_Lot_Zone_Pct + NOx_Concentration + 
##     Old_Homes_Pct + Distance_to_Employment_Centers + Highway_Access_Index + 
##     Property_Tax_Rate + Pupil_Teacher_Ratio + Median_Home_Value, 
##     family = binomial, data = crime_df)
## 
## Coefficients:
##                                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    -37.415922   6.035013  -6.200 5.65e-10 ***
## Large_Lot_Zone_Pct              -0.068648   0.032019  -2.144  0.03203 *  
## NOx_Concentration               42.807768   6.678692   6.410 1.46e-10 ***
## Old_Homes_Pct                    0.032950   0.010951   3.009  0.00262 ** 
## Distance_to_Employment_Centers   0.654896   0.214050   3.060  0.00222 ** 
## Highway_Access_Index             0.725109   0.149788   4.841 1.29e-06 ***
## Property_Tax_Rate               -0.007756   0.002653  -2.924  0.00346 ** 
## Pupil_Teacher_Ratio              0.323628   0.111390   2.905  0.00367 ** 
## Median_Home_Value                0.110472   0.035445   3.117  0.00183 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 197.32  on 457  degrees of freedom
## AIC: 215.32
## 
## Number of Fisher Scoring iterations: 9
# Add predicted probabilities from Model 2
model_2_crime <- crime_df |> 
  mutate(Model2_Pred = predict(model2, type = "response"))

# Get the features used in model2
model2_features <- names(coef(model2))[-1]

model_2_crime |> 
  pivot_longer(cols = all_of(model2_features), names_to = "Feature", values_to = "Value") |>
  ggplot(aes(x = Value, y = Model2_Pred)) +
  geom_point(aes(y = High_Crime_Area), alpha = 0.3, color = "steelblue") +
  geom_smooth(se = FALSE, color = "salmon", size = 1, method = "loess") +
  facet_wrap(~ Feature, scales = "free_x") +
  labs(title = "Model 2: Predicted Probability vs Stepwise Selected Features",
       y = "Predicted Probability of High Crime",
       x = "Feature Value") +
  theme_minimal()

NOx_Concentration, Old_Homes_Pct, Highway_Access_Index, Distance_to_Employment_Centers, Pupil_Teacher_Ratio, and Median_Home_Value all have positive and significant effects, meaning suburbs with higher pollution, older homes, better highway access, closer proximity to employment centers, higher pupil-to-teacher ratios, and higher home values are more likely to be high crime areas.

Property_Tax_Rate has a small negative effect, suggesting that higher property taxes slightly reduce the likelihood of high crime.

Large_Lot_Zone_Pct also has a modest negative effect, indicating that suburbs with more large-lot residential areas tend to have slightly lower crime.

Select Models

After comparing all three models, Model 2 is the best choice. It balances strong predictive power with interpretability, including both environmental and socioeconomic factors that make sense in the real world. Model 1 is simpler but misses important community features, and Model 3 adds interactions that slightly complicate the story without significantly improving performance. Using Model 2, we can explain which factors are linked to high-crime areas while keeping the model easy to understand.

Metric Analysis

# Add predicted probabilities and labels
model_2_crime <- model_2_crime |>
  mutate(
    Pred_Label = ifelse(Model2_Pred > 0.5, "High Crime", "Low Crime"),
    Actual_Label = ifelse(High_Crime_Area == 1, "High Crime", "Low Crime")
  )

# Confusion matrix
cm_selected <- table(Predicted = model_2_crime$Pred_Label, Actual = model_2_crime$Actual_Label)

# Extract values
TP <- cm_selected["High Crime","High Crime"]
TN <- cm_selected["Low Crime","Low Crime"]
FP <- cm_selected["High Crime","Low Crime"]
FN <- cm_selected["Low Crime","High Crime"]

# Calculate key metrics
accuracy <- (TP + TN) / sum(cm_selected)
classification_error <- 1 - accuracy  # (b)
precision <- TP / (TP + FP)
sensitivity <- TP / (TP + FN)   # (d) recall
specificity <- TN / (TN + FP)   # (e)
f1 <- 2 * (precision * sensitivity) / (precision + sensitivity)
pred <- prediction(model_2_crime$Model2_Pred, model_2_crime$High_Crime_Area)
auc <- performance(pred, measure = "auc")@y.values[[1]]

# Display metrics
cat("(a) Accuracy:", round(accuracy,3), "\n")
## (a) Accuracy: 0.912
cat("(b) Classification Error Rate:", round(classification_error,3), "\n")
## (b) Classification Error Rate: 0.088
cat("(c) Precision:", round(precision,3), "\n")
## (c) Precision: 0.916
cat("(d) Sensitivity:", round(sensitivity,3), "\n")
## (d) Sensitivity: 0.904
cat("(e) Specificity:", round(specificity,3), "\n")
## (e) Specificity: 0.92
cat("(f) F1 Score:", round(f1,3), "\n")
## (f) F1 Score: 0.91
cat("(g) AUC:", round(auc,3), "\n")
## (g) AUC: 0.972
cat("(h) Confusion Matrix:\n")
## (h) Confusion Matrix:
print(cm_selected)
##             Actual
## Predicted    High Crime Low Crime
##   High Crime        207        19
##   Low Crime          22       218

Make predictions with evaluation data

# Import the test data
crime_test_raw <- read_csv("https://raw.githubusercontent.com/evanskaylie/DATA621/refs/heads/main/crime-evaluation-data_modified.csv")

# Rename the features the same way as the training set
crime_test <- crime_test_raw |>
  rename(
    Large_Lot_Zone_Pct = zn, # proportion of residential land zoned for large lots (over 25000 square feet) 
    Industrial_Land_Pct = indus, # proportion of non-retail business acres per suburb
    Borders_Charles_River = chas, # a dummy var. for whether the suburb borders the Charles River (1) or not (0) 
    NOx_Concentration = nox, # nitrogen oxides concentration (parts per 10 million) 
    Avg_Rooms_Per_Dwelling = rm, # average number of rooms per dwelling 
    Old_Homes_Pct = age, # proportion of owner-occupied units built prior to 1940
    Distance_to_Employment_Centers = dis, # weighted mean of distances to five Boston employment centers 
    Highway_Access_Index = rad, # index of accessibility to radial highways 
    Property_Tax_Rate = tax, # full-value property-tax rate per $10,000
    Pupil_Teacher_Ratio = ptratio, # pupil-teacher ratio by town
    Low_Status_Pop_Pct = lstat, # lower status of the population (percent) 
    Median_Home_Value = medv # median value of owner-occupied homes in $1000s
  )

# Predict probabilities for the test set
crime_test <- crime_test |>
  (\(df) mutate(df, Pred_Prob = predict(model2, newdata = df, type = "response")))()

# Assign predicted labels using 0.5 cutoff
crime_test <- crime_test |>
  mutate(Pred_Label = ifelse(Pred_Prob > 0.5, "High Crime", "Low Crime"))

# Density plot of predicted probabilities
ggplot(crime_test, aes(x = Pred_Prob, fill = Pred_Label)) +
  geom_density(alpha = 0.5) +
  scale_fill_manual(values = c("Low Crime" = "steelblue", "High Crime" = "salmon")) +
  theme_minimal() +
  labs(title = "Predicted Probability Distribution for High Crime Areas",
       x = "Predicted Probability of High Crime",
       y = "Density",
       fill = "Predicted Category")

Higher probabilities mean higher confidence the area is high crime. Lower probabilities mean higher confidence the area is low crime.

Most predictions cluster near 0 or 1, meaning the model is quite confident for most suburbs. They are assigned clearly as low or high crime.

The middle of the graph represents where the model is least certain about predictions. Any minor overlap on the plot is just the way the smoothing works and doesn’t mean the model is misclassifying suburbs.

Since we don’t know the true crime status in the test set, these predictions are for future evaluation only.