The data set used in this modeling is comprised of variables that potentially could be used in order to predict whether or not a neighborhood has a crime rate above the national average in an American metropolitan area (likely Boston since a variable refers to the Charles River). These variables are chosen to provide insight into the character of a neighborhood and its residents, including scientific measurements, such as nitrogen oxide concentration levels as well as statistics to help describe the neighborhood’s contents, such as proportion of non-retail businesses or average rooms in a dwelling. One positive aspect of this data set is that there were no missing data, so questions of imputing never arose. Most of the variables in this data set were numeric, with only the target variable and char (adjacency to the Charles River) being binary coded categorical variables.
The table below provides the five-number summary for all variables.
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
## target
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4914
## 3rd Qu.:1.0000
## Max. :1.0000
After further investigation of the distributions of the variables with histograms, several potential issues were highlighted. Most variables are clearly not normally distributed, since we see left skew in the variable age, and right skew dis, nox, and lstat. Potentially more significant, however, are the variables were certain values are represented much more than others, and I am not referring to chas, since there are only two possible categories, but rather zn, tax, ptratio, and indus, so I had to look more closely at the distributions within these variables.
Looking at the correlation matrix of all of the numerical variables, many strong correlations were revealed. Dis in particular seems to have strong correlations, positive and negative, with many of the other variables. Focusing on correlations with the target, nox, age, rad, tax, and indus all have a moderately strong positive correlation (r > .6), and dis has a moderately strong negative correlation with the target (r < -.6). Looking at the matrix, there are combinations of these variables that are highly correlated with each other, so avoiding redundancy will be something to consider when finalizing the model.
These variables with the strongrest correlations to the target have their distributions visualized in these box plots. Visually, there does seem to be some distinction between the two levels for the first four variables, as expected with the correlation values. However for the last two, which had distributions with one value appearing more frequently than the rest.
## target nox age rad tax indus
## 1.00000000 0.72610622 0.63010625 0.62810492 0.61111331 0.60485074
## lstat ptratio chas rm medv zn
## 0.46912702 0.25084892 0.08004187 -0.15255334 -0.27055071 -0.43168176
## dis
## -0.61867312
## Data Preparation Drilling down on each of these two variables was
particularly insightful. Starting with tax, there is gap between around
450 to 650, followed by a cluster of all positive target values just
above 650 and a smaller cluster of negative target values above 700. As
for rad, this sort of gap is more pronounced, with no data appearing
between 8 and 24 and a large cluster of all positive target values at
24. This shows that neither of these variables should be treated as
continuous variables because there is clustered differences present make
the variable act more like a categorical one. This led me to investigate
the other variables whose distributions I had noted earlier. Of these
variables, indus was another that seemed like it should be treated as a
categorical variable with a cut off around 18 since there is another
cluster of all target variables appearing around that value, with a
smaller cluster of non-target ones beyond it. The teacher pupil ratio
distribution seems like there are two clusters of positive target
values, one just under 15 and one just above 20, but with the presence
of some values between them, I am less confident in binning this
variable. The final variable investiaged here, zn, showed the strong
presence of zero values in both positive and negative target areas.
These would be neighborhoods with no large residential lots, so perhaps
this will not be a significant contributor to a model .
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -0.5
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 13
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 156.25
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## -0.5
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 13
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 156.25
The next group of variables to assess are the ones that seemed to be skewed from the histograms to see if transformations would be appropriate. Zn was a good place to start since I had just confirmed the left skew in the variable, and applied a log transformation (or rather a log transformation on zn + 1 since the variable had zero as a value).
When it comes to age as a variable, I wanted to keep multiple options. The data show a concentration of high-crime neighborhoods at the oldest end of the age proportion spectrum. This suggests that binning age could be useful, as it allows the model to capture stepwise differences in crime risk that a continuous approach might miss. The other option I wanted to consider was age as a quadratic term, since there might be a nonlinear relationship between the variables. However, if I’m going to consider age as a quadratic term, I need to center the variable to reduce collinearity with the linear term within the model.
## `geom_smooth()` using formula = 'y ~ x'
In the last model, age hardly contributed to the model, so I wanted to try a different way of approaching the variable. I used a bucketed version in the previous model due to that prevalent block of positive target cases with high proportions, but this time I am approaching the potentially non-linear relationship with a quadratic term, as well as a centered version of the original term to avoid redundancy between the two variables. I had debated leaving out the other variables that were not significant in the previous model, but I figured if they were not significant, they would not make it through the stepwise selection process, so essentially all variables in their most robust form were included.
##
## ### Model 2 (stepAIC) chosen formula ###
## target ~ nox + medv + dis + zn_log + chas + rad_group + age_c +
## age_c_sq + ptratio_high
##
## Model 2 summary:
##
## Call:
## glm(formula = target ~ nox + medv + dis + zn_log + chas + rad_group +
## age_c + age_c_sq + ptratio_high, family = binomial, data = crime_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.766e+01 4.051e+00 -6.828 8.63e-12 ***
## nox 3.790e+01 5.812e+00 6.521 7.00e-11 ***
## medv 1.333e-01 2.958e-02 4.507 6.58e-06 ***
## dis 7.948e-01 2.071e-01 3.838 0.000124 ***
## zn_log -6.204e-01 2.064e-01 -3.006 0.002647 **
## chas 1.382e+00 6.940e-01 1.991 0.046516 *
## rad_groupHigh 1.756e+00 3.871e-01 4.537 5.71e-06 ***
## age_c 3.083e-02 1.043e-02 2.957 0.003102 **
## age_c_sq 4.651e-04 3.044e-04 1.528 0.126491
## ptratio_high 1.386e+00 4.229e-01 3.278 0.001046 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 227.27 on 456 degrees of freedom
## AIC: 247.27
##
## Number of Fisher Scoring iterations: 7
##
## VIFs for Model 2:
## nox medv dis zn_log chas rad_group
## 3.342858 1.969718 4.477564 1.978132 1.104132 1.109812
## age_c age_c_sq ptratio_high
## 1.958000 1.303163 1.501017
##
## --- Model 2 Confusion Matrix (cutoff = 0.5 )---
## Actual
## Predicted 0 1
## 0 211 23
## 1 26 206
## Sensitivity Specificity Pos Pred Value
## 0.8995633 0.8902954 0.8879310
## Neg Pred Value Precision Recall
## 0.9017094 0.8879310 0.8995633
## F1 Prevalence Detection Rate
## 0.8937093 0.4914163 0.4420601
## Detection Prevalence Balanced Accuracy
## 0.4978541 0.8949293
## Accuracy: 0.8948498
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC: 0.9625965
## Brier score: 0.0753
## # A tibble: 10 × 4
## dec mean_prob observed n
## <int> <dbl> <dbl> <int>
## 1 1 0.00269 0.0213 47
## 2 2 0.0122 0 47
## 3 3 0.0420 0.0638 47
## 4 4 0.115 0.106 47
## 5 5 0.334 0.298 47
## 6 6 0.631 0.596 47
## 7 7 0.848 0.870 46
## 8 8 0.972 1 46
## 9 9 0.997 1 46
## 10 10 1.000 1 46
The stepwise model refines the predictors and confirms several key relationships: pollution and highway access as seen prior, but adds inome and educational strain are major predictors of high crime neighborhoods, while areas with more single-family housing zoning having the strongest negative impact on the prediction. As for the building age, the centered version was significant, but not the squared term.
This model takes all of the significant variables from the previous model and adds three interaction terms to be test for additional explanatory power through a forward selection process. I looked at some of the strongest predictors in the previous two models, and thought what other variable might influence it. For instance, rad_group, or highway access might affect neighborhoods of different levels of wealth (medv) or industry (indus_high) in terms of accessibility and crime. The other example is the possible interaction between pollution (nox) and distance to job centers (dis), since there may be different sources of pollution in urban vs suburban environments.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Start: AIC=647.88
## target ~ 1
##
## Df Deviance AIC
## + nox 1 292.01 296.01
## + dis 1 409.50 413.50
## + age_c 1 424.75 428.75
## + ptratio_high 1 490.31 494.31
## + zn_log 1 529.12 533.12
## + rad_group 1 543.52 547.52
## + medv 1 609.62 613.62
## + chas 1 642.86 646.86
## <none> 645.88 647.88
##
## Step: AIC=296.01
## target ~ nox
##
## Df Deviance AIC
## + rad_group 1 273.42 279.42
## + medv 1 285.86 291.86
## + ptratio_high 1 286.94 292.94
## + chas 1 288.47 294.47
## + zn_log 1 289.05 295.05
## <none> 292.01 296.01
## + age_c 1 290.62 296.62
## + dis 1 290.91 296.91
##
## Step: AIC=279.42
## target ~ nox + rad_group
##
## Df Deviance AIC
## + rad_group:indus_high 2 257.74 267.74
## + ptratio_high 1 267.13 275.13
## + medv 1 269.27 277.27
## + zn_log 1 270.46 278.46
## + chas 1 270.51 278.51
## + dis 1 271.21 279.21
## + age_c 1 271.30 279.30
## <none> 273.42 279.42
##
## Step: AIC=267.74
## target ~ nox + rad_group + rad_group:indus_high
##
## Df Deviance AIC
## + medv 1 250.26 262.26
## + chas 1 253.58 265.58
## + age_c 1 254.25 266.25
## + zn_log 1 255.43 267.43
## <none> 257.74 267.74
## + dis 1 256.20 268.20
## + ptratio_high 1 256.49 268.49
##
## Step: AIC=262.26
## target ~ nox + rad_group + medv + rad_group:indus_high
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + medv:rad_group 1 214.94 228.94
## + ptratio_high 1 244.63 258.63
## + age_c 1 245.00 259.00
## + zn_log 1 246.07 260.07
## + dis 1 246.33 260.33
## + chas 1 247.17 261.17
## <none> 250.26 262.26
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=228.94
## target ~ nox + rad_group + medv + rad_group:indus_high + rad_group:medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + zn_log 1 208.89 224.89
## + chas 1 210.02 226.02
## + age_c 1 210.47 226.47
## + dis 1 212.77 228.77
## <none> 214.94 228.94
## + ptratio_high 1 214.58 230.58
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=224.89
## target ~ nox + rad_group + medv + zn_log + rad_group:indus_high +
## rad_group:medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + dis 1 201.57 219.57
## + age_c 1 205.25 223.25
## + chas 1 205.26 223.26
## <none> 208.89 224.89
## + ptratio_high 1 208.84 226.84
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=219.57
## target ~ nox + rad_group + medv + zn_log + dis + rad_group:indus_high +
## rad_group:medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + age_c 1 193.11 213.11
## + chas 1 197.67 217.67
## <none> 201.57 219.57
## + nox:dis 1 201.27 221.27
## + ptratio_high 1 201.57 221.57
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=213.11
## target ~ nox + rad_group + medv + zn_log + dis + age_c + rad_group:indus_high +
## rad_group:medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## + chas 1 191.00 213.00
## <none> 193.11 213.11
## + ptratio_high 1 193.10 215.10
## + nox:dis 1 193.11 215.11
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Step: AIC=213
## target ~ nox + rad_group + medv + zn_log + dis + age_c + chas +
## rad_group:indus_high + rad_group:medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Df Deviance AIC
## <none> 191.00 213.00
## + ptratio_high 1 190.86 214.86
## + nox:dis 1 191.00 215.00
##
## Call:
## glm(formula = target ~ nox + rad_group + medv + zn_log + dis +
## age_c + chas + rad_group:indus_high + rad_group:medv, family = binomial,
## data = crime_train)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -30.02206 4.72225 -6.358 2.05e-10 ***
## nox 48.13096 7.39066 6.512 7.40e-11 ***
## rad_groupHigh -10.65965 2.64954 -4.023 5.74e-05 ***
## medv 0.05783 0.03158 1.831 0.067075 .
## zn_log -0.76526 0.25437 -3.008 0.002626 **
## dis 0.74902 0.22631 3.310 0.000934 ***
## age_c 0.02844 0.01142 2.489 0.012794 *
## chas 1.07228 0.74577 1.438 0.150486
## rad_groupLow_Mid:indus_high -1.95995 0.68802 -2.849 0.004390 **
## rad_groupHigh:indus_high 21.87484 1208.93050 0.018 0.985564
## rad_groupHigh:medv 0.46455 0.10955 4.240 2.23e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 191.00 on 455 degrees of freedom
## AIC: 213
##
## Number of Fisher Scoring iterations: 19
##
## --- Model 3 Confusion Matrix (cutoff = 0.5 )---
## Actual
## Predicted 0 1
## 0 215 19
## 1 22 210
## Sensitivity Specificity Pos Pred Value
## 0.9170306 0.9071730 0.9051724
## Neg Pred Value Precision Recall
## 0.9188034 0.9051724 0.9170306
## F1 Prevalence Detection Rate
## 0.9110629 0.4914163 0.4506438
## Detection Prevalence Balanced Accuracy
## 0.4978541 0.9121018
## Accuracy: 0.9120172
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC: 0.9744256
## Brier score: 0.0614
## # A tibble: 10 × 4
## dec mean_prob observed n
## <int> <dbl> <dbl> <int>
## 1 1 0.000465 0 47
## 2 2 0.00257 0.0213 47
## 3 3 0.0214 0.0426 47
## 4 4 0.113 0.0426 47
## 5 5 0.312 0.298 47
## 6 6 0.620 0.638 47
## 7 7 0.885 0.913 46
## 8 8 1.000 1 46
## 9 9 1.000 1 46
## 10 10 1.000 1 46
The model shows that pollution (nox) remains the strongest predictor of high-crime areas, while large-lot zoning (zn_log) continues to have a protective effect. The inclusion of interaction terms reveals that the effect of industry (indus_high) depends on highway access. Low or moderate-access industrial zones are less likely to be high-crime areas. Additionally, the relationship between housing value and crime is amplified in areas with high accessibility. Since it has been significant in each model, it is worth mentioning that this value is not actually meaningful given the scales of the variables.
## Model AIC AUC Brier
## 1 Model 1 277.9152 0.9499659 0.09085994
## 2 Model 2 247.2653 0.9625965 0.07526229
## 3 Model 3 212.9962 0.9744256 0.06138835
##
## Hosmer-Lemeshow p-values (larger p > 0.05 indicates no evidence of poor fit):
## [1] "Model1:"
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: crime_train$target, fitted(m1)
## X-squared = 16.235, df = 8, p-value = 0.03914
## [1] "Model2:"
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: crime_train$target, fitted(m2_step)
## X-squared = 9.723, df = 8, p-value = 0.285
## [1] "Model3:"
##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: crime_train$target, fitted(m_full)
## X-squared = 8.393, df = 8, p-value = 0.3961
To determine the best model, I wanted to consider both predictive performance metrics and model interpretability. Looking at these results, model three is the best performing in every metric. It has the lowest AIC value, indicating the best balance between fit and complexity, as well as the lowest Brier score, indicating the best accuracy. It also has the highest AUC score, highlighting its discriminating ability, as well as the highest probability on the Hosmer-Lemeshow test, indicating the predicted probabilities align well with the observed data. Looking across the confusion matrices provided for each model, it is clear that model 3 contained the fewest errors. As for interpretability, model three has its complexity, particularly with three interaction terms among its eleven coefficients, two of which are categories of the same variable, one significant and one not. While this is not a simple model, each variable was included due to statistical significance or theorized relevance. The interaction terms added capture differences missed in the simpler models, evidenced by this model’s superior accuracy.
| target |
|---|
| 0 |
| 1 |
| 1 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |
library(tidyverse)
library(corrplot)
library(knitr)
library(car)
library(pROC)
library(MASS)
# Data Exploration
# Load data
crime_train <- read.csv("https://raw.githubusercontent.com/ddebonis47/classwork/refs/heads/main/crime-training-data_modified.csv")
crime_eval <- read.csv("https://raw.githubusercontent.com/ddebonis47/classwork/refs/heads/main/crime-evaluation-data_modified.csv")
numeric_vars <- crime_train |>
select(where(is.numeric))
numeric_long <- numeric_vars |>
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
# Plot all distributions in a grid
ggplot(numeric_long, aes(x = Value)) +
geom_histogram(bins = 30, fill = "#4E79A7", color = "white", alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol =4) +
labs(title = "Distribution of Numeric Variables",
x = "Value", y = "Count") +
theme_minimal()
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")
corrplot(cor_matrix, method = "color", tl.col = "black", type = "lower")
corr_threshold <- 0.6
target_corr <- cor(crime_train %>% select(where(is.numeric)), use = "pairwise.complete.obs")["target", ]
print(sort(target_corr, decreasing = TRUE))
boxplot_vars <- names(target_corr[abs(target_corr) > corr_threshold & names(target_corr) != "target"])
crime_train_long <- crime_train |>
select(all_of(boxplot_vars), target) |>
pivot_longer(-target, names_to = "Variable", values_to = "Value")
ggplot(crime_train_long, aes(x = factor(target), y = Value, fill = factor(target))) +
geom_boxplot(outlier.shape = NA, alpha = 0.7) +
facet_wrap(~Variable, scales = "free", ncol = 2) +
scale_fill_manual(values = c("#4E79A7", "#E15759"),
labels = c("Low Crime", "High Crime")) +
labs(x = "Crime Risk (Target)", y = "Value",
title = "Distribution of Key Predictors by Crime Risk") +
theme_minimal()
ggplot(crime_train, aes(x = tax, y = target)) +
geom_jitter(height = 0.05, alpha = 0.4) +
geom_smooth(method = "loess", color = "red") +
labs(title = "Relationship between Property Tax and Crime Risk")
ggplot(crime_train, aes(x = rad, y = target)) +
geom_jitter(height = 0.05, alpha = 0.4) +
geom_smooth(method = "loess", color = "blue") +
labs(title = "Relationship between Highway Access and Crime Risk")
ggplot(crime_train, aes(x = indus, y = target)) +
geom_jitter(height = 0.05, alpha = 0.4) +
geom_smooth(method = "loess", color = "green") +
labs(title = "Relationship between Industry Presence and Crime Risk")
ggplot(crime_train, aes(x = ptratio, y = target)) +
geom_jitter(height = 0.05, alpha = 0.4) +
geom_smooth(method = "loess", color = "brown") +
labs(title = "Relationship between Pupil:Teacher Ratio and Crime Risk")
ggplot(crime_train, aes(x = zn, y = target)) +
geom_jitter(height = 0.05, alpha = 0.4) +
geom_smooth(method = "loess", color = "purple") +
labs(title = "Relationship between Large Lot Zoning and Crime Risk")
# Data Transformation
# zn
crime_train <- crime_train |>
mutate(
zn_log = log1p(zn) # compress large values, handle zeros
)
# rad
crime_train$rad_group <- cut(crime_train$rad,
breaks = c(-Inf, 5, 24, Inf),
labels = c("Low_Mid", "High", "Extreme"))
crime_train$rad_group <- factor(crime_train$rad_group)
# tax
crime_train <- crime_train |>
mutate(tax_high = ifelse(tax > 600, 1, 0))
# indus
crime_train <- crime_train |>
mutate(indus_high = ifelse(indus > 18, 1, 0))
# ptratio
crime_train <- crime_train |>
mutate(ptratio_high = ifelse(ptratio > 20, 1, 0))
# rooms_per_age: rooms per age ratio
crime_train <- crime_train |>
mutate(rooms_per_age = rm / (age + 1))
#age
ggplot(crime_train, aes(x = age, y = target)) +
geom_jitter(height = 0.05, alpha = 0.4) +
geom_smooth(method = "loess", color = "purple") +
labs(title = "Relationship between Housing Age and Crime Risk")
# Options for crime_train$age
# Apply log transformation
crime_train$age_log <- log1p(crime_train$age)
# Or apply square root
crime_train$age_sqrt <- sqrt(crime_train$age)
# Binned
crime_train$age_bin <- cut(crime_train$age,
breaks = c(0, 30, 60, 100),
labels = c("Low", "Medium", "High"))
# Quadratic term
crime_train <- crime_train |>
mutate(age_sq = age^2)
# Cenetered
crime_train$age_c <- crime_train$age - mean(crime_train$age)
crime_train$age_c_sq <- crime_train$age_c^2
# Build Models
#evaluation
eval_model <- function(model, data, cutoff = 0.5, label = "Model") {
# Predictions and probabilities
probs <- predict(model, newdata = data, type = "response")
preds <- ifelse(probs >= cutoff, 1, 0)
# Confusion matrix
cm <- table(Predicted = preds, Actual = data$target)
cat("\n---", label, "Confusion Matrix (cutoff =", cutoff, ")---\n")
print(cm)
# Metrics
cmobj <- confusionMatrix(as.factor(preds), as.factor(data$target), positive = "1")
print(cmobj$byClass) # sensitivity, specificity, etc
cat("Accuracy:", cmobj$overall["Accuracy"], "\n")
# AUC
rocobj <- roc(data$target, probs)
cat("AUC:", auc(rocobj), "\n")
# Brier score
brier <- mean((probs - data$target)^2)
cat("Brier score:", round(brier,4), "\n")
# Calibration by decile
data_cal <- data.frame(prob = probs, actual = data$target) %>%
mutate(dec = ntile(prob, 10)) %>%
group_by(dec) %>%
summarise(mean_prob = mean(prob), observed = mean(actual), n = n())
print(data_cal)
# return list
list(confusion = cm, caret = cmobj, roc = rocobj, brier = brier, cal_table = data_cal)
}
# Odds ratio helper
print_or <- function(model) {
coefs <- coef(summary(model))
OR <- exp(coef(model))
ci <- exp(confint(model))
or_table <- data.frame(Estimate = coef(model), OR = OR, CI_low = ci[,1], CI_high = ci[,2], p_value = coefs[,4])
return(or_table)
}
## model 1
m1_formula <- as.formula("target ~ nox + tax_high + indus + dis + rad_group + age_bin")
m1 <- glm(m1_formula, data = crime_train, family = binomial)
cat("\n### Model 1 summary ###\n")
print(summary(m1))
cat("\nVIFs for Model 1:\n")
print(vif(m1))
m1_eval <- eval_model(m1, crime_train, cutoff = 0.5, label = "Model 1")
## model 2
m2_full_formula <- as.formula(
"target ~ nox + lstat + rm + medv + dis + zn_log + chas + rad_group + age_c + age_c_sq + tax_high + ptratio_high + indus_high + rooms_per_age"
)
m2_full <- glm(m2_full_formula, data = crime_train, family = binomial)
# Use stepAIC both directions to select a model by AIC
m2_step <- stepAIC(m2_full, direction = "both", trace = FALSE)
cat("\n### Model 2 (stepAIC) chosen formula ###\n")
print(formula(m2_step))
cat("\nModel 2 summary:\n")
print(summary(m2_step))
cat("\nVIFs for Model 2:\n")
print(vif(m2_step))
m2_eval <- eval_model(m2_step, crime_train, cutoff = 0.5, label = "Model 2")
## model 3
# Start with an empty model (intercept only)
m0 <- glm(target ~ 1, data = crime_train, family = binomial)
# Full model with all candidate predictors
m_full <- glm(target ~ medv + nox + dis + zn_log + chas + rad_group +
age_c + ptratio_high + rad_group:indus_high + nox:dis + rad_group:medv,
data = crime_train, family = binomial)
# Forward selection based on AIC
m_forward <- stepAIC(m0,
scope = list(lower = m0, upper = m_full),
direction = "forward",
trace = TRUE)
summary(m_forward)
m3_eval <- eval_model(m_forward, crime_train, cutoff = 0.5, label = "Model 3")
# Select Model
comparison <- data.frame(
Model = c("Model 1", "Model 2", "Model 3"),
AIC = c(AIC(m1), AIC(m2_step), AIC(m_forward)),
AUC = c(auc(m1_eval$roc), auc(m2_eval$roc), auc(m3_eval$roc)),
Brier = c(m1_eval$brier, m2_eval$brier, m3_eval$brier)
)
print(comparison)
cat("\nHosmer-Lemeshow p-values (larger p > 0.05 indicates no evidence of poor fit):\n")
print("Model1:")
print(hoslem.test(crime_train$target, fitted(m1), g=10))
print("Model2:")
print(hoslem.test(crime_train$target, fitted(m2_step), g=10))
print("Model3:")
print(hoslem.test(crime_train$target, fitted(m_full), g=10))
## Adding derived variables to crime_eval
# zn
crime_eval <- crime_eval |>
mutate(
zn_log = log1p(zn) # compress large values, handle zeros
)
# rad
crime_eval$rad_group <- cut(crime_eval$rad,
breaks = c(-Inf, 5, 24, Inf),
labels = c("Low_Mid", "High", "Extreme"))
crime_eval$rad_group <- factor(crime_eval$rad_group)
# tax
crime_eval <- crime_eval |>
mutate(tax_high = ifelse(tax > 600, 1, 0))
# indus
crime_eval <- crime_eval |>
mutate(indus_high = ifelse(indus > 18, 1, 0))
# ptratio
crime_eval <- crime_eval |>
mutate(ptratio_high = ifelse(ptratio > 20, 1, 0))
# age: centered
crime_eval$age_c <- crime_eval$age - mean(crime_eval$age)
pred_probs <- predict(m_full, newdata = crime_eval, type = "response")
pred_class <- ifelse(pred_probs > 0.5, 1, 0)
crime_eval$target <- pred_class
kable(head(crime_eval["target"], 10),
caption = "Predicted classes for the evaluation dataset")
write.csv(crime_eval, "crime_eval_predictions.csv", row.names = FALSE)