Assignment 3 Report

Daniel DeBonis

Data Preparation

The data set used in this modeling is comprised of variables that potentially could be used in order to predict whether or not a neighborhood has a crime rate above the national average in an American metropolitan area (likely Boston since a variable refers to the Charles River). These variables are chosen to provide insight into the character of a neighborhood and its residents, including scientific measurements, such as nitrogen oxide concentration levels as well as statistics to help describe the neighborhood’s contents, such as proportion of non-retail businesses or average rooms in a dwelling. One positive aspect of this data set is that there were no missing data, so questions of imputing never arose. Most of the variables in this data set were numeric, with only the target variable and char (adjacency to the Charles River) being binary coded categorical variables.

The table below provides the five-number summary for all variables.

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

After further investigation of the distributions of the variables with histograms, several potential issues were highlighted. Most variables are clearly not normally distributed, since we see left skew in the variable age, and right skew dis, nox, and lstat. Potentially more significant, however, are the variables were certain values are represented much more than others, and I am not referring to chas, since there are only two possible categories, but rather zn, tax, ptratio, and indus, so I had to look more closely at the distributions within these variables.

Looking at the correlation matrix of all of the numerical variables, many strong correlations were revealed. Dis in particular seems to have strong correlations, positive and negative, with many of the other variables. Focusing on correlations with the target, nox, age, rad, tax, and indus all have a moderately strong positive correlation (r > .6), and dis has a moderately strong negative correlation with the target (r < -.6). Looking at the matrix, there are combinations of these variables that are highly correlated with each other, so avoiding redundancy will be something to consider when finalizing the model.

These variables with the strongrest correlations to the target have their distributions visualized in these box plots. Visually, there does seem to be some distinction between the two levels for the first four variables, as expected with the correlation values. However for the last two, which had distributions with one value appearing more frequently than the rest.

##      target         nox         age         rad         tax       indus 
##  1.00000000  0.72610622  0.63010625  0.62810492  0.61111331  0.60485074 
##       lstat     ptratio        chas          rm        medv          zn 
##  0.46912702  0.25084892  0.08004187 -0.15255334 -0.27055071 -0.43168176 
##         dis 
## -0.61867312

## Data Preparation Drilling down on each of these two variables was particularly insightful. Starting with tax, there is gap between around 450 to 650, followed by a cluster of all positive target values just above 650 and a smaller cluster of negative target values above 700. As for rad, this sort of gap is more pronounced, with no data appearing between 8 and 24 and a large cluster of all positive target values at 24. This shows that neither of these variables should be treated as continuous variables because there is clustered differences present make the variable act more like a categorical one. This led me to investigate the other variables whose distributions I had noted earlier. Of these variables, indus was another that seemed like it should be treated as a categorical variable with a cut off around 18 since there is another cluster of all target variables appearing around that value, with a smaller cluster of non-target ones beyond it. The teacher pupil ratio distribution seems like there are two clusters of positive target values, one just under 15 and one just above 20, but with the presence of some values between them, I am less confident in binning this variable. The final variable investiaged here, zn, showed the strong presence of zero values in both positive and negative target areas. These would be neighborhoods with no large residential lots, so perhaps this will not be a significant contributor to a model .

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'

## `geom_smooth()` using formula = 'y ~ x'
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : pseudoinverse used at -0.5
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : neighborhood radius 13
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : reciprocal condition number 0
## Warning in simpleLoess(y, x, w, span, degree = degree, parametric = parametric,
## : There are other near singularities as well. 156.25
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : pseudoinverse used at
## -0.5
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : neighborhood radius 13
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : reciprocal condition
## number 0
## Warning in predLoess(object$y, object$x, newx = if (is.null(newdata)) object$x
## else if (is.data.frame(newdata))
## as.matrix(model.frame(delete.response(terms(object)), : There are other near
## singularities as well. 156.25

The next group of variables to assess are the ones that seemed to be skewed from the histograms to see if transformations would be appropriate. Zn was a good place to start since I had just confirmed the left skew in the variable, and applied a log transformation (or rather a log transformation on zn + 1 since the variable had zero as a value).

When it comes to age as a variable, I wanted to keep multiple options. The data show a concentration of high-crime neighborhoods at the oldest end of the age proportion spectrum. This suggests that binning age could be useful, as it allows the model to capture stepwise differences in crime risk that a continuous approach might miss. The other option I wanted to consider was age as a quadratic term, since there might be a nonlinear relationship between the variables. However, if I’m going to consider age as a quadratic term, I need to center the variable to reduce collinearity with the linear term within the model.

## `geom_smooth()` using formula = 'y ~ x'

Building Models

Model One - Using Highly Correlated variables

For the first binary logistic regression model, I wanted to use the most strongly correlated variables to the target to see how they contribute to a model. This was intended to serve as a baseline for further decision making. Correlation-wise, these were the most likely candidates to explain the variance, so I needed to know if any had particular predictive potential.

## Warning: package 'MASS' was built under R version 4.5.2
## Warning: package 'ResourceSelection' was built under R version 4.5.2
## 
## ### Model 1 summary ###
## 
## Call:
## glm(formula = m1_formula, family = binomial, data = crime_train)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -21.55880    3.33276  -6.469 9.88e-11 ***
## nox            38.79708    6.06676   6.395 1.61e-10 ***
## tax_high        1.01094    0.68798   1.469  0.14172    
## indus          -0.10314    0.04244  -2.430  0.01510 *  
## dis             0.25844    0.14813   1.745  0.08104 .  
## rad_groupHigh   1.20282    0.39096   3.077  0.00209 ** 
## age_binMedium  -0.30299    0.73921  -0.410  0.68189    
## age_binHigh     0.47721    0.71952   0.663  0.50718    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 261.92  on 458  degrees of freedom
## AIC: 277.92
## 
## Number of Fisher Scoring iterations: 7
## 
## VIFs for Model 1:
##               GVIF Df GVIF^(1/(2*Df))
## nox       4.192677  1        2.047603
## tax_high  1.689289  1        1.299727
## indus     3.247811  1        1.802168
## dis       2.555310  1        1.598534
## rad_group 1.312791  1        1.145771
## age_bin   1.510851  2        1.108678
## 
## --- Model 1 Confusion Matrix (cutoff = 0.5 )---
##          Actual
## Predicted   0   1
##         0 206  37
##         1  31 192
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8384279            0.8691983            0.8609865 
##       Neg Pred Value            Precision               Recall 
##            0.8477366            0.8609865            0.8384279 
##                   F1           Prevalence       Detection Rate 
##            0.8495575            0.4914163            0.4120172 
## Detection Prevalence    Balanced Accuracy 
##            0.4785408            0.8538131 
## Accuracy: 0.8540773
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC: 0.9499659 
## Brier score: 0.0909 
## # A tibble: 10 × 4
##      dec mean_prob observed     n
##    <int>     <dbl>    <dbl> <int>
##  1     1   0.00934   0         47
##  2     2   0.0253    0.0213    47
##  3     3   0.0580    0.0426    47
##  4     4   0.168     0.149     47
##  5     5   0.368     0.426     47
##  6     6   0.570     0.723     47
##  7     7   0.790     0.609     46
##  8     8   0.966     0.978     46
##  9     9   0.998     1         46
## 10    10   1.000     1         46

Overall, this first model does manage to serve as a respectable point of comparison because it performed moderately well. However, not many of the variables are contributing significantly to the model overall. The three variables that were statistically significant contributors were nox, rad_group, and indus. In this model, pollution and accessibility emerge as the strongest predictors of high crime areas. The model does have some predictive power going by the accuracy statistics, but hopefully the introduction of interaction or non-linear terms will improve this. But more importantly, I want to compare to a more technical variable selection process

Model 2 - Stepwise Variable Selection

In the last model, age hardly contributed to the model, so I wanted to try a different way of approaching the variable. I used a bucketed version in the previous model due to that prevalent block of positive target cases with high proportions, but this time I am approaching the potentially non-linear relationship with a quadratic term, as well as a centered version of the original term to avoid redundancy between the two variables. I had debated leaving out the other variables that were not significant in the previous model, but I figured if they were not significant, they would not make it through the stepwise selection process, so essentially all variables in their most robust form were included.

## 
## ### Model 2 (stepAIC) chosen formula ###
## target ~ nox + medv + dis + zn_log + chas + rad_group + age_c + 
##     age_c_sq + ptratio_high
## 
## Model 2 summary:
## 
## Call:
## glm(formula = target ~ nox + medv + dis + zn_log + chas + rad_group + 
##     age_c + age_c_sq + ptratio_high, family = binomial, data = crime_train)
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -2.766e+01  4.051e+00  -6.828 8.63e-12 ***
## nox            3.790e+01  5.812e+00   6.521 7.00e-11 ***
## medv           1.333e-01  2.958e-02   4.507 6.58e-06 ***
## dis            7.948e-01  2.071e-01   3.838 0.000124 ***
## zn_log        -6.204e-01  2.064e-01  -3.006 0.002647 ** 
## chas           1.382e+00  6.940e-01   1.991 0.046516 *  
## rad_groupHigh  1.756e+00  3.871e-01   4.537 5.71e-06 ***
## age_c          3.083e-02  1.043e-02   2.957 0.003102 ** 
## age_c_sq       4.651e-04  3.044e-04   1.528 0.126491    
## ptratio_high   1.386e+00  4.229e-01   3.278 0.001046 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 227.27  on 456  degrees of freedom
## AIC: 247.27
## 
## Number of Fisher Scoring iterations: 7
## 
## VIFs for Model 2:
##          nox         medv          dis       zn_log         chas    rad_group 
##     3.342858     1.969718     4.477564     1.978132     1.104132     1.109812 
##        age_c     age_c_sq ptratio_high 
##     1.958000     1.303163     1.501017
## 
## --- Model 2 Confusion Matrix (cutoff = 0.5 )---
##          Actual
## Predicted   0   1
##         0 211  23
##         1  26 206
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8995633            0.8902954            0.8879310 
##       Neg Pred Value            Precision               Recall 
##            0.9017094            0.8879310            0.8995633 
##                   F1           Prevalence       Detection Rate 
##            0.8937093            0.4914163            0.4420601 
## Detection Prevalence    Balanced Accuracy 
##            0.4978541            0.8949293 
## Accuracy: 0.8948498
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC: 0.9625965 
## Brier score: 0.0753 
## # A tibble: 10 × 4
##      dec mean_prob observed     n
##    <int>     <dbl>    <dbl> <int>
##  1     1   0.00269   0.0213    47
##  2     2   0.0122    0         47
##  3     3   0.0420    0.0638    47
##  4     4   0.115     0.106     47
##  5     5   0.334     0.298     47
##  6     6   0.631     0.596     47
##  7     7   0.848     0.870     46
##  8     8   0.972     1         46
##  9     9   0.997     1         46
## 10    10   1.000     1         46

The stepwise model refines the predictors and confirms several key relationships: pollution and highway access as seen prior, but adds inome and educational strain are major predictors of high crime neighborhoods, while areas with more single-family housing zoning having the strongest negative impact on the prediction. As for the building age, the centered version was significant, but not the squared term.

Model 3 - Testing Hypothesized Interactions in Forward Selection

This model takes all of the significant variables from the previous model and adds three interaction terms to be test for additional explanatory power through a forward selection process. I looked at some of the strongest predictors in the previous two models, and thought what other variable might influence it. For instance, rad_group, or highway access might affect neighborhoods of different levels of wealth (medv) or industry (indus_high) in terms of accessibility and crime. The other example is the possible interaction between pollution (nox) and distance to job centers (dis), since there may be different sources of pollution in urban vs suburban environments.

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Start:  AIC=647.88
## target ~ 1
## 
##                Df Deviance    AIC
## + nox           1   292.01 296.01
## + dis           1   409.50 413.50
## + age_c         1   424.75 428.75
## + ptratio_high  1   490.31 494.31
## + zn_log        1   529.12 533.12
## + rad_group     1   543.52 547.52
## + medv          1   609.62 613.62
## + chas          1   642.86 646.86
## <none>              645.88 647.88
## 
## Step:  AIC=296.01
## target ~ nox
## 
##                Df Deviance    AIC
## + rad_group     1   273.42 279.42
## + medv          1   285.86 291.86
## + ptratio_high  1   286.94 292.94
## + chas          1   288.47 294.47
## + zn_log        1   289.05 295.05
## <none>              292.01 296.01
## + age_c         1   290.62 296.62
## + dis           1   290.91 296.91
## 
## Step:  AIC=279.42
## target ~ nox + rad_group
## 
##                        Df Deviance    AIC
## + rad_group:indus_high  2   257.74 267.74
## + ptratio_high          1   267.13 275.13
## + medv                  1   269.27 277.27
## + zn_log                1   270.46 278.46
## + chas                  1   270.51 278.51
## + dis                   1   271.21 279.21
## + age_c                 1   271.30 279.30
## <none>                      273.42 279.42
## 
## Step:  AIC=267.74
## target ~ nox + rad_group + rad_group:indus_high
## 
##                Df Deviance    AIC
## + medv          1   250.26 262.26
## + chas          1   253.58 265.58
## + age_c         1   254.25 266.25
## + zn_log        1   255.43 267.43
## <none>              257.74 267.74
## + dis           1   256.20 268.20
## + ptratio_high  1   256.49 268.49
## 
## Step:  AIC=262.26
## target ~ nox + rad_group + medv + rad_group:indus_high
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                  Df Deviance    AIC
## + medv:rad_group  1   214.94 228.94
## + ptratio_high    1   244.63 258.63
## + age_c           1   245.00 259.00
## + zn_log          1   246.07 260.07
## + dis             1   246.33 260.33
## + chas            1   247.17 261.17
## <none>                250.26 262.26
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=228.94
## target ~ nox + rad_group + medv + rad_group:indus_high + rad_group:medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                Df Deviance    AIC
## + zn_log        1   208.89 224.89
## + chas          1   210.02 226.02
## + age_c         1   210.47 226.47
## + dis           1   212.77 228.77
## <none>              214.94 228.94
## + ptratio_high  1   214.58 230.58
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=224.89
## target ~ nox + rad_group + medv + zn_log + rad_group:indus_high + 
##     rad_group:medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                Df Deviance    AIC
## + dis           1   201.57 219.57
## + age_c         1   205.25 223.25
## + chas          1   205.26 223.26
## <none>              208.89 224.89
## + ptratio_high  1   208.84 226.84
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=219.57
## target ~ nox + rad_group + medv + zn_log + dis + rad_group:indus_high + 
##     rad_group:medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                Df Deviance    AIC
## + age_c         1   193.11 213.11
## + chas          1   197.67 217.67
## <none>              201.57 219.57
## + nox:dis       1   201.27 221.27
## + ptratio_high  1   201.57 221.57
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=213.11
## target ~ nox + rad_group + medv + zn_log + dis + age_c + rad_group:indus_high + 
##     rad_group:medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                Df Deviance    AIC
## + chas          1   191.00 213.00
## <none>              193.11 213.11
## + ptratio_high  1   193.10 215.10
## + nox:dis       1   193.11 215.11
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## 
## Step:  AIC=213
## target ~ nox + rad_group + medv + zn_log + dis + age_c + chas + 
##     rad_group:indus_high + rad_group:medv
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##                Df Deviance    AIC
## <none>              191.00 213.00
## + ptratio_high  1   190.86 214.86
## + nox:dis       1   191.00 215.00
## 
## Call:
## glm(formula = target ~ nox + rad_group + medv + zn_log + dis + 
##     age_c + chas + rad_group:indus_high + rad_group:medv, family = binomial, 
##     data = crime_train)
## 
## Coefficients:
##                               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                  -30.02206    4.72225  -6.358 2.05e-10 ***
## nox                           48.13096    7.39066   6.512 7.40e-11 ***
## rad_groupHigh                -10.65965    2.64954  -4.023 5.74e-05 ***
## medv                           0.05783    0.03158   1.831 0.067075 .  
## zn_log                        -0.76526    0.25437  -3.008 0.002626 ** 
## dis                            0.74902    0.22631   3.310 0.000934 ***
## age_c                          0.02844    0.01142   2.489 0.012794 *  
## chas                           1.07228    0.74577   1.438 0.150486    
## rad_groupLow_Mid:indus_high   -1.95995    0.68802  -2.849 0.004390 ** 
## rad_groupHigh:indus_high      21.87484 1208.93050   0.018 0.985564    
## rad_groupHigh:medv             0.46455    0.10955   4.240 2.23e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 191.00  on 455  degrees of freedom
## AIC: 213
## 
## Number of Fisher Scoring iterations: 19
## 
## --- Model 3 Confusion Matrix (cutoff = 0.5 )---
##          Actual
## Predicted   0   1
##         0 215  19
##         1  22 210
##          Sensitivity          Specificity       Pos Pred Value 
##            0.9170306            0.9071730            0.9051724 
##       Neg Pred Value            Precision               Recall 
##            0.9188034            0.9051724            0.9170306 
##                   F1           Prevalence       Detection Rate 
##            0.9110629            0.4914163            0.4506438 
## Detection Prevalence    Balanced Accuracy 
##            0.4978541            0.9121018 
## Accuracy: 0.9120172
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## AUC: 0.9744256 
## Brier score: 0.0614 
## # A tibble: 10 × 4
##      dec mean_prob observed     n
##    <int>     <dbl>    <dbl> <int>
##  1     1  0.000465   0         47
##  2     2  0.00257    0.0213    47
##  3     3  0.0214     0.0426    47
##  4     4  0.113      0.0426    47
##  5     5  0.312      0.298     47
##  6     6  0.620      0.638     47
##  7     7  0.885      0.913     46
##  8     8  1.000      1         46
##  9     9  1.000      1         46
## 10    10  1.000      1         46

The model shows that pollution (nox) remains the strongest predictor of high-crime areas, while large-lot zoning (zn_log) continues to have a protective effect. The inclusion of interaction terms reveals that the effect of industry (indus_high) depends on highway access. Low or moderate-access industrial zones are less likely to be high-crime areas. Additionally, the relationship between housing value and crime is amplified in areas with high accessibility. Since it has been significant in each model, it is worth mentioning that this value is not actually meaningful given the scales of the variables.

Model Selection

##     Model      AIC       AUC      Brier
## 1 Model 1 277.9152 0.9499659 0.09085994
## 2 Model 2 247.2653 0.9625965 0.07526229
## 3 Model 3 212.9962 0.9744256 0.06138835
## 
## Hosmer-Lemeshow p-values (larger p > 0.05 indicates no evidence of poor fit):
## [1] "Model1:"
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  crime_train$target, fitted(m1)
## X-squared = 16.235, df = 8, p-value = 0.03914
## [1] "Model2:"
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  crime_train$target, fitted(m2_step)
## X-squared = 9.723, df = 8, p-value = 0.285
## [1] "Model3:"
## 
##  Hosmer and Lemeshow goodness of fit (GOF) test
## 
## data:  crime_train$target, fitted(m_full)
## X-squared = 8.393, df = 8, p-value = 0.3961

To determine the best model, I wanted to consider both predictive performance metrics and model interpretability. Looking at these results, model three is the best performing in every metric. It has the lowest AIC value, indicating the best balance between fit and complexity, as well as the lowest Brier score, indicating the best accuracy. It also has the highest AUC score, highlighting its discriminating ability, as well as the highest probability on the Hosmer-Lemeshow test, indicating the predicted probabilities align well with the observed data. Looking across the confusion matrices provided for each model, it is clear that model 3 contained the fewest errors. As for interpretability, model three has its complexity, particularly with three interaction terms among its eleven coefficients, two of which are categories of the same variable, one significant and one not. While this is not a simple model, each variable was included due to statistical significance or theorized relevance. The interaction terms added capture differences missed in the simpler models, evidenced by this model’s superior accuracy.

Evaluation set

Predicted classes for the evaluation dataset
target
0
1
1
0
0
0
0
0
0
0

Code Appendix

library(tidyverse)
library(corrplot)
library(knitr)
library(car)        
library(pROC)       
library(MASS)

# Data Exploration
# Load data
crime_train <- read.csv("https://raw.githubusercontent.com/ddebonis47/classwork/refs/heads/main/crime-training-data_modified.csv")
crime_eval  <- read.csv("https://raw.githubusercontent.com/ddebonis47/classwork/refs/heads/main/crime-evaluation-data_modified.csv")
numeric_vars <- crime_train |> 
  select(where(is.numeric))
numeric_long <- numeric_vars |>
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

# Plot all distributions in a grid
ggplot(numeric_long, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "#4E79A7", color = "white", alpha = 0.7) +
  facet_wrap(~Variable, scales = "free", ncol =4) +
  labs(title = "Distribution of Numeric Variables",
       x = "Value", y = "Count") +
  theme_minimal()
cor_matrix <- cor(numeric_vars, use = "pairwise.complete.obs")
corrplot(cor_matrix, method = "color", tl.col = "black", type = "lower")
corr_threshold <- 0.6
target_corr <- cor(crime_train %>% select(where(is.numeric)), use = "pairwise.complete.obs")["target", ]
print(sort(target_corr, decreasing = TRUE))
boxplot_vars <- names(target_corr[abs(target_corr) > corr_threshold & names(target_corr) != "target"])
crime_train_long <- crime_train |>
  select(all_of(boxplot_vars), target) |>
  pivot_longer(-target, names_to = "Variable", values_to = "Value")

ggplot(crime_train_long, aes(x = factor(target), y = Value, fill = factor(target))) +
  geom_boxplot(outlier.shape = NA, alpha = 0.7) +
  facet_wrap(~Variable, scales = "free", ncol = 2) +
  scale_fill_manual(values = c("#4E79A7", "#E15759"),
                    labels = c("Low Crime", "High Crime")) +
  labs(x = "Crime Risk (Target)", y = "Value",
       title = "Distribution of Key Predictors by Crime Risk") +
  theme_minimal()
ggplot(crime_train, aes(x = tax, y = target)) +
  geom_jitter(height = 0.05, alpha = 0.4) +
  geom_smooth(method = "loess", color = "red") +
  labs(title = "Relationship between Property Tax and Crime Risk")

ggplot(crime_train, aes(x = rad, y = target)) +
  geom_jitter(height = 0.05, alpha = 0.4) +
  geom_smooth(method = "loess", color = "blue") +
  labs(title = "Relationship between Highway Access and Crime Risk")

ggplot(crime_train, aes(x = indus, y = target)) +
  geom_jitter(height = 0.05, alpha = 0.4) +
  geom_smooth(method = "loess", color = "green") +
  labs(title = "Relationship between Industry Presence and Crime Risk")

ggplot(crime_train, aes(x = ptratio, y = target)) +
  geom_jitter(height = 0.05, alpha = 0.4) +
  geom_smooth(method = "loess", color = "brown") +
  labs(title = "Relationship between Pupil:Teacher Ratio and Crime Risk")

ggplot(crime_train, aes(x = zn, y = target)) +
  geom_jitter(height = 0.05, alpha = 0.4) +
  geom_smooth(method = "loess", color = "purple") +
  labs(title = "Relationship between Large Lot Zoning and Crime Risk")

# Data Transformation
# zn
crime_train <- crime_train |>
  mutate(
    zn_log = log1p(zn)  # compress large values, handle zeros
  )

# rad
crime_train$rad_group <- cut(crime_train$rad,
                             breaks = c(-Inf, 5, 24, Inf),
                             labels = c("Low_Mid", "High", "Extreme"))

crime_train$rad_group <- factor(crime_train$rad_group)


# tax
crime_train <- crime_train |>
  mutate(tax_high = ifelse(tax > 600, 1, 0))

# indus
crime_train <- crime_train |>
  mutate(indus_high = ifelse(indus > 18, 1, 0))

# ptratio
crime_train <- crime_train |>
  mutate(ptratio_high = ifelse(ptratio > 20, 1, 0))

# rooms_per_age: rooms per age ratio
crime_train <- crime_train |>
  mutate(rooms_per_age = rm / (age + 1))

#age
ggplot(crime_train, aes(x = age, y = target)) +
  geom_jitter(height = 0.05, alpha = 0.4) +
  geom_smooth(method = "loess", color = "purple") +
  labs(title = "Relationship between Housing Age and Crime Risk")
# Options for crime_train$age

# Apply log transformation
crime_train$age_log <- log1p(crime_train$age)

# Or apply square root
crime_train$age_sqrt <- sqrt(crime_train$age)

# Binned
crime_train$age_bin <- cut(crime_train$age,
                           breaks = c(0, 30, 60, 100),
                           labels = c("Low", "Medium", "High"))

# Quadratic term
crime_train <- crime_train |>
  mutate(age_sq = age^2)

# Cenetered
crime_train$age_c <- crime_train$age - mean(crime_train$age)
crime_train$age_c_sq <- crime_train$age_c^2

# Build Models
#evaluation
eval_model <- function(model, data, cutoff = 0.5, label = "Model") {
  # Predictions and probabilities
  probs <- predict(model, newdata = data, type = "response")
  preds <- ifelse(probs >= cutoff, 1, 0)
  
  # Confusion matrix
  cm <- table(Predicted = preds, Actual = data$target)
  cat("\n---", label, "Confusion Matrix (cutoff =", cutoff, ")---\n")
  print(cm)
  
  # Metrics
  cmobj <- confusionMatrix(as.factor(preds), as.factor(data$target), positive = "1")
  print(cmobj$byClass)  # sensitivity, specificity, etc
  cat("Accuracy:", cmobj$overall["Accuracy"], "\n")
  
  # AUC
  rocobj <- roc(data$target, probs)
  cat("AUC:", auc(rocobj), "\n")
  
  # Brier score
  brier <- mean((probs - data$target)^2)
  cat("Brier score:", round(brier,4), "\n")
  
  # Calibration by decile
  data_cal <- data.frame(prob = probs, actual = data$target) %>%
    mutate(dec = ntile(prob, 10)) %>%
    group_by(dec) %>%
    summarise(mean_prob = mean(prob), observed = mean(actual), n = n())
  print(data_cal)
  
  # return list
  list(confusion = cm, caret = cmobj, roc = rocobj, brier = brier, cal_table = data_cal)
}

# Odds ratio helper
print_or <- function(model) {
  coefs <- coef(summary(model))
  OR <- exp(coef(model))
  ci <- exp(confint(model))
  or_table <- data.frame(Estimate = coef(model), OR = OR, CI_low = ci[,1], CI_high = ci[,2], p_value = coefs[,4])
  return(or_table)
}
## model 1
m1_formula <- as.formula("target ~ nox + tax_high + indus + dis + rad_group + age_bin")
m1 <- glm(m1_formula, data = crime_train, family = binomial)
cat("\n### Model 1 summary ###\n")
print(summary(m1))

cat("\nVIFs for Model 1:\n")
print(vif(m1))

m1_eval <- eval_model(m1, crime_train, cutoff = 0.5, label = "Model 1")

## model 2
m2_full_formula <- as.formula(
  "target ~ nox + lstat + rm + medv + dis + zn_log + chas + rad_group + age_c + age_c_sq + tax_high + ptratio_high + indus_high + rooms_per_age"
)
m2_full <- glm(m2_full_formula, data = crime_train, family = binomial)

# Use stepAIC both directions to select a model by AIC
m2_step <- stepAIC(m2_full, direction = "both", trace = FALSE)
cat("\n### Model 2 (stepAIC) chosen formula ###\n")
print(formula(m2_step))
cat("\nModel 2 summary:\n")
print(summary(m2_step))
cat("\nVIFs for Model 2:\n")
print(vif(m2_step))

m2_eval <- eval_model(m2_step, crime_train, cutoff = 0.5, label = "Model 2")

## model 3
# Start with an empty model (intercept only)
m0 <- glm(target ~ 1, data = crime_train, family = binomial)

# Full model with all candidate predictors
m_full <- glm(target ~ medv + nox + dis + zn_log + chas + rad_group + 
                age_c + ptratio_high + rad_group:indus_high + nox:dis + rad_group:medv,
              data = crime_train, family = binomial)

# Forward selection based on AIC
m_forward <- stepAIC(m0, 
                     scope = list(lower = m0, upper = m_full), 
                     direction = "forward", 
                     trace = TRUE)

summary(m_forward)
m3_eval <- eval_model(m_forward, crime_train, cutoff = 0.5, label = "Model 3")

# Select Model
comparison <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3"),
  AIC = c(AIC(m1), AIC(m2_step), AIC(m_forward)),
  AUC = c(auc(m1_eval$roc), auc(m2_eval$roc), auc(m3_eval$roc)),
  Brier = c(m1_eval$brier, m2_eval$brier, m3_eval$brier)
)
print(comparison)

cat("\nHosmer-Lemeshow p-values (larger p > 0.05 indicates no evidence of poor fit):\n")
print("Model1:")
print(hoslem.test(crime_train$target, fitted(m1), g=10))
print("Model2:")
print(hoslem.test(crime_train$target, fitted(m2_step), g=10))
print("Model3:")
print(hoslem.test(crime_train$target, fitted(m_full), g=10))

## Adding derived variables to crime_eval
# zn
crime_eval <- crime_eval |>
  mutate(
    zn_log = log1p(zn)  # compress large values, handle zeros
  )

# rad
crime_eval$rad_group <- cut(crime_eval$rad,
                             breaks = c(-Inf, 5, 24, Inf),
                             labels = c("Low_Mid", "High", "Extreme"))

crime_eval$rad_group <- factor(crime_eval$rad_group)

# tax
crime_eval <- crime_eval |>
  mutate(tax_high = ifelse(tax > 600, 1, 0))

# indus
crime_eval <- crime_eval |>
  mutate(indus_high = ifelse(indus > 18, 1, 0))

# ptratio
crime_eval <- crime_eval |>
  mutate(ptratio_high = ifelse(ptratio > 20, 1, 0))

# age: centered
crime_eval$age_c <- crime_eval$age - mean(crime_eval$age)

pred_probs <- predict(m_full, newdata = crime_eval, type = "response")

pred_class <- ifelse(pred_probs > 0.5, 1, 0)

crime_eval$target <- pred_class
kable(head(crime_eval["target"], 10),
      caption = "Predicted classes for the evaluation dataset")

write.csv(crime_eval, "crime_eval_predictions.csv", row.names = FALSE)