The following RMD contains CUNY SPS DATA 621 Fall 2025 context for the Homework 03 assignment.
Our crime dataset includes 13 features across Boston suburbs, capturing housing, environmental, and socioeconomic factors. The target variable, High_Crime_Area, indicates whether a suburb’s crime rate is above or below the median, allowing us to explore which neighborhood attributes are most associated with higher crime.
library(tidyverse)
library(ggcorrplot)
library(GGally)
library(ROCR)
library(dplyr)
# Import the provided data
crime_raw <- read_csv("https://raw.githubusercontent.com/evanskaylie/DATA621/refs/heads/main/crime-training-data_modified.csv")
# Rename the features
crime_df <- crime_raw |>
rename(
Large_Lot_Zone_Pct = zn, # proportion of residential land zoned for large lots (over 25000 square feet)
Industrial_Land_Pct = indus, # proportion of non-retail business acres per suburb
Borders_Charles_River = chas, # a dummy var. for whether the suburb borders the Charles River (1) or not (0)
NOx_Concentration = nox, # nitrogen oxides concentration (parts per 10 million)
Avg_Rooms_Per_Dwelling = rm, # average number of rooms per dwelling
Old_Homes_Pct = age, # proportion of owner-occupied units built prior to 1940
Distance_to_Employment_Centers = dis, # weighted mean of distances to five Boston employment centers
Highway_Access_Index = rad, # index of accessibility to radial highways
Property_Tax_Rate = tax, # full-value property-tax rate per $10,000
Pupil_Teacher_Ratio = ptratio, # pupil-teacher ratio by town
Low_Status_Pop_Pct = lstat, # lower status of the population (percent)
Median_Home_Value = medv, # median value of owner-occupied homes in $1000s
High_Crime_Area = target # whether the crime rate is above the median crime rate (1) or not (0)
)
# Count missing vals and summarize the provided data
colSums(is.na(crime_df))
## Large_Lot_Zone_Pct Industrial_Land_Pct
## 0 0
## Borders_Charles_River NOx_Concentration
## 0 0
## Avg_Rooms_Per_Dwelling Old_Homes_Pct
## 0 0
## Distance_to_Employment_Centers Highway_Access_Index
## 0 0
## Property_Tax_Rate Pupil_Teacher_Ratio
## 0 0
## Low_Status_Pop_Pct Median_Home_Value
## 0 0
## High_Crime_Area
## 0
summary(crime_df)
## Large_Lot_Zone_Pct Industrial_Land_Pct Borders_Charles_River NOx_Concentration
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## Avg_Rooms_Per_Dwelling Old_Homes_Pct Distance_to_Employment_Centers
## Min. :3.863 Min. : 2.90 Min. : 1.130
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101
## Median :6.210 Median : 77.15 Median : 3.191
## Mean :6.291 Mean : 68.37 Mean : 3.796
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215
## Max. :8.780 Max. :100.00 Max. :12.127
## Highway_Access_Index Property_Tax_Rate Pupil_Teacher_Ratio Low_Status_Pop_Pct
## Min. : 1.00 Min. :187.0 Min. :12.6 Min. : 1.730
## 1st Qu.: 4.00 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043
## Median : 5.00 Median :334.5 Median :18.9 Median :11.350
## Mean : 9.53 Mean :409.5 Mean :18.4 Mean :12.631
## 3rd Qu.:24.00 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930
## Max. :24.00 Max. :711.0 Max. :22.0 Max. :37.970
## Median_Home_Value High_Crime_Area
## Min. : 5.00 Min. :0.0000
## 1st Qu.:17.02 1st Qu.:0.0000
## Median :21.20 Median :0.0000
## Mean :22.59 Mean :0.4914
## 3rd Qu.:25.00 3rd Qu.:1.0000
## Max. :50.00 Max. :1.0000
There are no missing values found in the data set, so no imputation is necessary
Most of the data shows older homes (median 77%), moderate pollution levels (median NOx ~0.54), and varying highway access, all of which show strong potential links to high-crime areas. Key outliers include a few suburbs with very high property taxes, large lots, or extreme highway access, which may also influence crime patterns.
# Standardize numeric features
crime_standardized <- crime_df |>
select(where(is.numeric)) |>
mutate(across(everything(), scale)) |>
pivot_longer(cols = everything(), names_to = "Feature", values_to = "Value")
# Feature distribution violin plots
crime_standardized |>
ggplot(aes(x = Feature, y = Value)) +
geom_violin(fill = "steelblue", alpha = 0.6) +
theme_minimal() +
theme(
axis.text.x = element_text(
angle = 45,
hjust = 1,
face = ifelse(levels(factor(crime_standardized$Feature)) == "High_Crime_Area", "bold", "plain")
)
) +
labs(
title = "Data Snapshot: Distribution of Standardized Features",
subtitle = "High_Crime_Area highlighted in bold",
x = NULL,
y = "Standardized Value"
)
All numeric predictors standardized for comparison; binary target is excluded from violin plot. These plots show the distribution of values across each feature. Standardization is for visualization clarity; logistic regression uses raw values. These distributions show that some predictors are skewed, which may influence the model’s sensitivity to extreme values and help us understand where transformations or careful interpretation might be needed.
# Compute correlation matrix
corr <- crime_df |>
select(where(is.numeric)) |>
cor(use = "complete.obs")
# Compute correlations with target
cor_with_target <- corr["High_Crime_Area", ] |>
enframe(name = "Feature", value = "Correlation") |>
filter(Feature != "High_Crime_Area") |>
arrange(desc(Correlation))
# Reorder the correlation matrix by target correlation
feature_order <- cor_with_target$Feature
corr_ordered <- corr[c("High_Crime_Area", feature_order), c("High_Crime_Area", feature_order)]
# Correlation heat plot
ggcorrplot(corr_ordered,
method = "circle",
type = "lower",
lab = TRUE,
lab_size = 3,
colors = c("salmon", "white", "steelblue"),
title = "Correlation Matrix",
ggtheme = theme_minimal() +
theme(
axis.text.x = element_text(angle = 45, hjust = 1, size = 8),
axis.text.y = element_text(size = 8),
panel.spacing = unit(0.1, "lines") # tighter spacing between panels
)) +
coord_fixed(ratio = .8)
The most important part of this graph is the bottom row. This shows each predictor’s correlation with the High_Crime_Area values. By ranking features by absolute value of correlations with High_Crime_Area, we can see that NOx_Concentration, Old_Homes_Pct, and Highway_Access_Index are the top predictors, with Distance_to_Employment_Centers has a strong negative relationship
# Get the features with the most correlation
top_features <- c("NOx_Concentration",
"Old_Homes_Pct",
"Highway_Access_Index",
"Distance_to_Employment_Centers")
# Visualized jittered scatter plots
crime_df |>
select(all_of(top_features), High_Crime_Area) |>
pivot_longer(cols = -High_Crime_Area, names_to = "Feature", values_to = "Value") |>
ggplot(aes(x = Value, y = factor(High_Crime_Area), color = factor(High_Crime_Area))) +
geom_jitter(height = 0.1, alpha = 0.6) +
facet_wrap(~ Feature, scales = "free_x") +
theme_minimal() +
labs(title = "Predictor vs High Crime Area",
y = "High Crime Area (0 = No, 1 = Yes)",
x = "Feature Value",
color = "High Crime Area")
These scatter plots show the distribution of highly correlated features across areas considered lower crime (0 for High Crime Area), and areas considered higher crime (1 for High Crime Area). In all four of these plots, the distribution of values varies clearly between lower and higher crime areas.
The plots show clear differences between high and low crime suburbs:
High crime areas tend to:
be closer to employment centers
have higher highway access (with some extreme outliers)
have higher NOx concentrations
be a larger proportion of older homes.
Low-crime areas tend to:
These patterns suggest that proximity to employment, transportation access, pollution levels, and housing age may be important indicators of high-crime areas.
Other features show weaker relationships with high crime areas and are not visualized here to keep the story clear.
Renaming variables: All features were given clear names, like changing rm to Avg_Rooms_Per_Dwelling. This makes tables and plots easier to read and interpret.
Standardized numbers: All numeric features were scaled so they have the same range (mean 0, standard deviation 1) for the comparisons in the violin plots. This makes for fair comparisons of distributions, even though the original features are on different scales.
Feature selection: The features most strongly linked to high crime were chosen for the jitter correlation plots. This helps show clear patterns in the plots without showing all 12 features.
In this section, three different binary logistic regression models are fit to predict whether a suburb is a high-crime area. Each model uses different features or transformations to highlight potential drivers of high crime.
Model 1: Using the four most related features: NOx_Concentration, Old_Homes_Pct, Highway_Access_Index, and Distance_to_Employment_Centers. This is to see how the most correlated features relate to crime.
Model 2: Stepwise selection added socioeconomic factors like Property_Tax_Rate, Pupil_Teacher_Ratio, and Median_Home_Value, which improved predictions and included community and environmental information.
Model 3: Combined some related features (Highway Access × Property Tax and Old Homes × NOx) to look at how features work together to affect crime.
# Predict on top correlation features
model1 <- glm(High_Crime_Area ~ NOx_Concentration +
Old_Homes_Pct +
Highway_Access_Index +
Distance_to_Employment_Centers,
data = crime_df, family = binomial
)
# View model details
summary(model1)
##
## Call:
## glm(formula = High_Crime_Area ~ NOx_Concentration + Old_Homes_Pct +
## Highway_Access_Index + Distance_to_Employment_Centers, family = binomial,
## data = crime_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.224999 3.199415 -6.321 2.59e-10 ***
## NOx_Concentration 28.296396 5.071309 5.580 2.41e-08 ***
## Old_Homes_Pct 0.017349 0.009008 1.926 0.0541 .
## Highway_Access_Index 0.521295 0.109765 4.749 2.04e-06 ***
## Distance_to_Employment_Centers 0.239127 0.147455 1.622 0.1049
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 234.17 on 461 degrees of freedom
## AIC: 244.17
##
## Number of Fisher Scoring iterations: 8
# Add predicted probabilities for plot
model_1_crime <- crime_df |>
mutate(Model1_Pred = predict(model1, type = "response"))
# Plot predicted probability vs top features
model_1_crime |>
pivot_longer(cols = all_of(top_features), names_to = "Feature", values_to = "Value") |>
ggplot(aes(x = Value, y = Model1_Pred)) +
geom_point(aes(y = High_Crime_Area), alpha = 0.3, color = "steelblue") +
geom_smooth(se = FALSE, color = "salmon", size = 1, method = "loess") +
facet_wrap(~ Feature, scales = "free_x") +
labs(title = "Model 1: Predicted Probability vs Top Correlated Features",
y = "Predicted Probability of High Crime",
x = "Feature Value") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
NOx_Concentration and Highway_Access_Index are strongly positive and highly significant predictors (p < 0.001). This means suburbs with higher pollution levels or better highway access are much more likely to be high-crime areas.
Old_Homes_Pct is somewhat significant (p ~ 0.054), suggesting older housing may slightly increase crime risk.
Distance_to_Employment_Centers is not statistically significant (p ~ 0.10), indicating it has a weaker or less clear effect on crime in this model.
# Get full model
full_model <- glm(High_Crime_Area ~ . , data = crime_df, family = binomial)
# Do silent stepwise selection
model2 <- step(full_model, direction = "both", trace = 0)
# View model details
summary(model2)
##
## Call:
## glm(formula = High_Crime_Area ~ Large_Lot_Zone_Pct + NOx_Concentration +
## Old_Homes_Pct + Distance_to_Employment_Centers + Highway_Access_Index +
## Property_Tax_Rate + Pupil_Teacher_Ratio + Median_Home_Value,
## family = binomial, data = crime_df)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.415922 6.035013 -6.200 5.65e-10 ***
## Large_Lot_Zone_Pct -0.068648 0.032019 -2.144 0.03203 *
## NOx_Concentration 42.807768 6.678692 6.410 1.46e-10 ***
## Old_Homes_Pct 0.032950 0.010951 3.009 0.00262 **
## Distance_to_Employment_Centers 0.654896 0.214050 3.060 0.00222 **
## Highway_Access_Index 0.725109 0.149788 4.841 1.29e-06 ***
## Property_Tax_Rate -0.007756 0.002653 -2.924 0.00346 **
## Pupil_Teacher_Ratio 0.323628 0.111390 2.905 0.00367 **
## Median_Home_Value 0.110472 0.035445 3.117 0.00183 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.32 on 457 degrees of freedom
## AIC: 215.32
##
## Number of Fisher Scoring iterations: 9
# Add predicted probabilities from Model 2
model_2_crime <- crime_df |>
mutate(Model2_Pred = predict(model2, type = "response"))
# Get the features used in model2
model2_features <- names(coef(model2))[-1]
model_2_crime |>
pivot_longer(cols = all_of(model2_features), names_to = "Feature", values_to = "Value") |>
ggplot(aes(x = Value, y = Model2_Pred)) +
geom_point(aes(y = High_Crime_Area), alpha = 0.3, color = "steelblue") +
geom_smooth(se = FALSE, color = "salmon", size = 1, method = "loess") +
facet_wrap(~ Feature, scales = "free_x") +
labs(title = "Model 2: Predicted Probability vs Stepwise Selected Features",
y = "Predicted Probability of High Crime",
x = "Feature Value") +
theme_minimal()
NOx_Concentration, Old_Homes_Pct, Highway_Access_Index, Distance_to_Employment_Centers, Pupil_Teacher_Ratio, and Median_Home_Value all have positive and significant effects, meaning suburbs with higher pollution, older homes, better highway access, closer proximity to employment centers, higher pupil-to-teacher ratios, and higher home values are more likely to be high crime areas.
Property_Tax_Rate has a small negative effect, suggesting that higher property taxes slightly reduce the likelihood of high crime.
Large_Lot_Zone_Pct also has a modest negative effect, indicating that suburbs with more large-lot residential areas tend to have slightly lower crime.
After comparing all three models, Model 2 is the best choice. It balances strong predictive power with interpretability, including both environmental and socioeconomic factors that make sense in the real world. Model 1 is simpler but misses important community features, and Model 3 adds interactions that slightly complicate the story without significantly improving performance. Using Model 2, we can explain which factors are linked to high-crime areas while keeping the model easy to understand.
# Add predicted probabilities and labels
model_2_crime <- model_2_crime |>
mutate(
Pred_Label = ifelse(Model2_Pred > 0.5, "High Crime", "Low Crime"),
Actual_Label = ifelse(High_Crime_Area == 1, "High Crime", "Low Crime")
)
# Confusion matrix
cm_selected <- table(Predicted = model_2_crime$Pred_Label, Actual = model_2_crime$Actual_Label)
# Extract values
TP <- cm_selected["High Crime","High Crime"]
TN <- cm_selected["Low Crime","Low Crime"]
FP <- cm_selected["High Crime","Low Crime"]
FN <- cm_selected["Low Crime","High Crime"]
# Calculate key metrics
accuracy <- (TP + TN) / sum(cm_selected)
classification_error <- 1 - accuracy # (b)
precision <- TP / (TP + FP)
sensitivity <- TP / (TP + FN) # (d) recall
specificity <- TN / (TN + FP) # (e)
f1 <- 2 * (precision * sensitivity) / (precision + sensitivity)
pred <- prediction(model_2_crime$Model2_Pred, model_2_crime$High_Crime_Area)
auc <- performance(pred, measure = "auc")@y.values[[1]]
# Display metrics
cat("(a) Accuracy:", round(accuracy,3), "\n")
## (a) Accuracy: 0.912
cat("(b) Classification Error Rate:", round(classification_error,3), "\n")
## (b) Classification Error Rate: 0.088
cat("(c) Precision:", round(precision,3), "\n")
## (c) Precision: 0.916
cat("(d) Sensitivity:", round(sensitivity,3), "\n")
## (d) Sensitivity: 0.904
cat("(e) Specificity:", round(specificity,3), "\n")
## (e) Specificity: 0.92
cat("(f) F1 Score:", round(f1,3), "\n")
## (f) F1 Score: 0.91
cat("(g) AUC:", round(auc,3), "\n")
## (g) AUC: 0.972
cat("(h) Confusion Matrix:\n")
## (h) Confusion Matrix:
print(cm_selected)
## Actual
## Predicted High Crime Low Crime
## High Crime 207 19
## Low Crime 22 218
# Import the test data
crime_test_raw <- read_csv("https://raw.githubusercontent.com/evanskaylie/DATA621/refs/heads/main/crime-evaluation-data_modified.csv")
# Rename the features the same way as the training set
crime_test <- crime_test_raw |>
rename(
Large_Lot_Zone_Pct = zn, # proportion of residential land zoned for large lots (over 25000 square feet)
Industrial_Land_Pct = indus, # proportion of non-retail business acres per suburb
Borders_Charles_River = chas, # a dummy var. for whether the suburb borders the Charles River (1) or not (0)
NOx_Concentration = nox, # nitrogen oxides concentration (parts per 10 million)
Avg_Rooms_Per_Dwelling = rm, # average number of rooms per dwelling
Old_Homes_Pct = age, # proportion of owner-occupied units built prior to 1940
Distance_to_Employment_Centers = dis, # weighted mean of distances to five Boston employment centers
Highway_Access_Index = rad, # index of accessibility to radial highways
Property_Tax_Rate = tax, # full-value property-tax rate per $10,000
Pupil_Teacher_Ratio = ptratio, # pupil-teacher ratio by town
Low_Status_Pop_Pct = lstat, # lower status of the population (percent)
Median_Home_Value = medv # median value of owner-occupied homes in $1000s
)
# Predict probabilities for the test set
crime_test <- crime_test |>
(\(df) mutate(df, Pred_Prob = predict(model2, newdata = df, type = "response")))()
# Assign predicted labels using 0.5 cutoff
crime_test <- crime_test |>
mutate(Pred_Label = ifelse(Pred_Prob > 0.5, "High Crime", "Low Crime"))
# Density plot of predicted probabilities
ggplot(crime_test, aes(x = Pred_Prob, fill = Pred_Label)) +
geom_density(alpha = 0.5) +
scale_fill_manual(values = c("Low Crime" = "steelblue", "High Crime" = "salmon")) +
theme_minimal() +
labs(title = "Predicted Probability Distribution for High Crime Areas",
x = "Predicted Probability of High Crime",
y = "Density",
fill = "Predicted Category")
Higher probabilities mean higher confidence the area is high crime. Lower probabilities mean higher confidence the area is low crime.
Most predictions cluster near 0 or 1, meaning the model is quite confident for most suburbs. They are assigned clearly as low or high crime.
The middle of the graph represents where the model is least certain about predictions. Any minor overlap on the plot is just the way the smoothing works and doesn’t mean the model is misclassifying suburbs.
Since we don’t know the true crime status in the test set, these predictions are for future evaluation only.