There are 466 rows and 13 columns found in the crime training data set. Each column represents a variable, and there are 12 predictor variables and 1 response variable.
Response Variable
target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)
Predictor Variables
zn: proportion of residential land zoned for large lots (over 25000 square feet)
indus: proportion of non-retail business acres per suburb
chas: a dummy variable for whether the suburb borders the Charles River (1) or not (0)
nox: nitrogen oxides concentration (parts per 10 million)
rm: average number of rooms per dwelling
age: proportion of owner-occupied units built prior to 1940
dis: weighted mean of distances to five Boston employment centers
rad: index of accessibility to radial highways
tax: full-value property-tax rate per $10,000
ptratio: pupil-teacher ratio by town
lstat: lower status of the population (percent)
medv: median value of owner-occupied homes in $1000s
Is there are missing data in training data set?
##    Variable miss_data_Values
## 1        zn                0
## 2     indus                0
## 3      chas                0
## 4       nox                0
## 5        rm                0
## 6       age                0
## 7       dis                0
## 8       rad                0
## 9       tax                0
## 10  ptratio                0
## 11    lstat                0
## 12     medv                0
## 13   target                0
There are no missing data values found in any of the predictor variables and the response variable.
| vars | n | mean | sd | median | trimmed | |
|---|---|---|---|---|---|---|
| zn | 1 | 466 | 11.5772532 | 23.3646511 | 0.00000 | 5.3542781 | 
| indus | 2 | 466 | 11.1050215 | 6.8458549 | 9.69000 | 10.9082353 | 
| chas | 3 | 466 | 0.0708155 | 0.2567920 | 0.00000 | 0.0000000 | 
| nox | 4 | 466 | 0.5543105 | 0.1166667 | 0.53800 | 0.5442684 | 
| rm | 5 | 466 | 6.2906738 | 0.7048513 | 6.21000 | 6.2570615 | 
| age | 6 | 466 | 68.3675966 | 28.3213784 | 77.15000 | 70.9553476 | 
| dis | 7 | 466 | 3.7956929 | 2.1069496 | 3.19095 | 3.5443647 | 
| rad | 8 | 466 | 9.5300429 | 8.6859272 | 5.00000 | 8.6978610 | 
| tax | 9 | 466 | 409.5021459 | 167.9000887 | 334.50000 | 401.5080214 | 
| ptratio | 10 | 466 | 18.3984979 | 2.1968447 | 18.90000 | 18.5970588 | 
| lstat | 11 | 466 | 12.6314592 | 7.1018907 | 11.35000 | 11.8809626 | 
| medv | 12 | 466 | 22.5892704 | 9.2396814 | 21.20000 | 21.6304813 | 
| target | 13 | 466 | 0.4914163 | 0.5004636 | 0.00000 | 0.4893048 | 
| mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|
| zn | 0.0000000 | 0.0000 | 100.0000 | 100.0000 | 2.1768152 | 3.8135765 | 1.0823466 | 
| indus | 9.3403800 | 0.4600 | 27.7400 | 27.2800 | 0.2885450 | -1.2432132 | 0.3171281 | 
| chas | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 3.3354899 | 9.1451313 | 0.0118957 | 
| nox | 0.1334340 | 0.3890 | 0.8710 | 0.4820 | 0.7463281 | -0.0357736 | 0.0054045 | 
| rm | 0.5166861 | 3.8630 | 8.7800 | 4.9170 | 0.4793202 | 1.5424378 | 0.0326516 | 
| age | 30.0226500 | 2.9000 | 100.0000 | 97.1000 | -0.5777075 | -1.0098814 | 1.3119625 | 
| dis | 1.9144814 | 1.1296 | 12.1265 | 10.9969 | 0.9988926 | 0.4719679 | 0.0976026 | 
| rad | 1.4826000 | 1.0000 | 24.0000 | 23.0000 | 1.0102788 | -0.8619110 | 0.4023678 | 
| tax | 104.5233000 | 187.0000 | 711.0000 | 524.0000 | 0.6593136 | -1.1480456 | 7.7778214 | 
| ptratio | 1.9273800 | 12.6000 | 22.0000 | 9.4000 | -0.7542681 | -0.4003627 | 0.1017669 | 
| lstat | 7.0720020 | 1.7300 | 37.9700 | 36.2400 | 0.9055864 | 0.5033688 | 0.3289887 | 
| medv | 6.0045300 | 5.0000 | 50.0000 | 45.0000 | 1.0766920 | 1.3737825 | 0.4280200 | 
| target | 0.0000000 | 0.0000 | 1.0000 | 1.0000 | 0.0342293 | -2.0031131 | 0.0231835 | 
From the box plot of the training data, we can see that the many of the variables are skewed in their distribution. Variables such as medv, rm and zn have an enormous amount of outliers. However, let’s group the variables by the target variable to test if it plays a factor.
Through grouping the predictors in our training data set with the response variable, we can see that the variables are correlated to the target. This means that while before we only saw that there were a ton of outliers in some of our variables, they are related to the data in a significant way and should be included in our modeling.
To check if the data is correlated to the target variable or to any other variable, we can begin by creating a correlation matrix.
Positive Correlations with Target: Variables zn and chas show positive correlations with crime rate. This indicates that areas with larger residential lots and proximity to the Charles River tend to have higher crime rates.
Negative Correlations with Target: Variables indus, nox, age, dis, rad, tax, ptratio, lstat, and medv show negative correlations with crime rate. This suggests that areas with higher industrial presence, pollution levels, older housing stock, greater distance from employment centers, better highway access, higher tax rates, higher pupil-teacher ratios, lower socio-economic status, and higher median home values tend to have lower crime rates.
Notably the correlation matrix reveals strong correlations between tax and rad at 0.91 and many other variables have significant correlation at 0.70, so these variable relationships should be taken into consideration building predictive models to avoid potential multicollinearity issues.
To identify these high correlations between variables, we devised a list of variable relationships to consider:
| var1 | var2 | correlation | 
|---|---|---|
| tax | rad | 0.906 | 
| rad | tax | 0.906 | 
| dis | nox | -0.769 | 
| nox | dis | -0.769 | 
| nox | indus | 0.760 | 
| indus | nox | 0.760 | 
| dis | age | -0.751 | 
| age | dis | -0.751 | 
| medv | lstat | -0.736 | 
| lstat | medv | -0.736 | 
| age | nox | 0.735 | 
| nox | age | 0.735 | 
| tax | indus | 0.732 | 
| indus | tax | 0.732 | 
| target | nox | 0.726 | 
| nox | target | 0.726 | 
| medv | rm | 0.705 | 
| rm | medv | 0.705 | 
| dis | indus | -0.704 | 
| indus | dis | -0.704 | 
The goal of this step was to clean the data and prepare it for accurate modeling. We checked for missing values, improved variable quality through transformations and new features, and reduced overlap between predictors to avoid multicollinearity.
Missing Data: We checked each column in the training dataset for missing values. All results were zero, meaning the dataset was complete and required no imputation.
Transformations: Two variables, NOX (air pollution) and LSTAT (lower-status population), were right-skewed. Log transformations were applied to both to reduce skew and make relationships more linear.
Feature Engineering: Two new variables were created:
- rm_sq: RM squared (RM²) to capture a curved relationship
between housing size and crime - rm_lstat_i: RM × LSTAT to
test whether the impact of housing size changes across different income
levels
Multicollinearity Reduction: Several variables were highly correlated. To improve model stability, we: - Kept RAD, dropped TAX (both measured road access, correlation = 0.91) - Kept NOX, dropped INDUS (both described industrial activity) - Kept LSTAT, dropped MEDV (both reflected neighborhood wealth)
# Create transformed training data
crime_train <- crime_train_data %>%
  mutate(
    target = factor(target, levels = c(0, 1)),
    nox_log = log(nox),
    lstat_log = log1p(lstat)
  ) %>%
  select(-tax, -indus, -medv, -nox, -lstat) %>%
  mutate(
    rm_sq = rm^2,
    rm_lstat_i = rm * lstat_log
  )
# Create transformed evaluation data
crime_eval <- crime_eval_data %>%
  mutate(
    nox_log = log(nox),
    lstat_log = log1p(lstat)
  ) %>%
  select(-tax, -indus, -medv, -nox, -lstat) %>%
  mutate(
    rm_sq = rm^2,
    rm_lstat_i = rm * lstat_log
  )
# Check VIF on core variables (excluding interaction terms)
library(car)
vif_values <- vif(glm(target ~ zn + chas + rm + age + dis + rad + ptratio + 
                      nox_log + lstat_log, 
                      data = crime_train, family = binomial()))
knitr::kable(data.frame(Variable = names(vif_values), VIF = vif_values),
             caption = "Variance Inflation Factors After Preparation",
             digits = 2, row.names = FALSE)
| Variable | VIF | 
|---|---|
| zn | 1.74 | 
| chas | 1.14 | 
| rm | 2.90 | 
| age | 2.04 | 
| dis | 3.20 | 
| rad | 1.23 | 
| ptratio | 1.60 | 
| nox_log | 3.12 | 
| lstat_log | 3.44 | 
After these adjustments, all Variance Inflation Factor (VIF) values are below 5, confirming that multicollinearity has been resolved.
The prepared dataset includes the following predictors:
Core variables: zn, chas, rm, age, dis, rad, ptratio
Transformed variables: nox_log, lstat_log
Engineered features (for Model 3): rm_sq, rm_lstat_i
The response variable is target (1 = high crime, 0 = low crime).
Models 1 and 2 use the core and transformed variables, while Model 3 additionally incorporates the engineered interaction and polynomial terms to test non-linear relationships.
This model includes all core transformed variables
# Full model with all prepared predictors
model1 <- glm(target ~ zn + chas + rm + age + dis + rad + ptratio + 
              nox_log + lstat_log, 
              data = crime_train, 
              family = binomial(link = "logit"))
summary(model1)
## 
## Call:
## glm(formula = target ~ zn + chas + rm + age + dis + rad + ptratio + 
##     nox_log + lstat_log, family = binomial(link = "logit"), data = crime_train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.40666    4.53605  -0.310  0.75648    
## zn          -0.04957    0.02689  -1.843  0.06527 .  
## chas         1.22174    0.74450   1.641  0.10079    
## rm           0.84225    0.46020   1.830  0.06722 .  
## age          0.01738    0.01098   1.582  0.11356    
## dis          0.58463    0.18870   3.098  0.00195 ** 
## rad          0.48961    0.12264   3.992 6.55e-05 ***
## ptratio      0.16122    0.09508   1.696  0.08996 .  
## nox_log     19.22228    3.17966   6.045 1.49e-09 ***
## lstat_log   -0.34144    0.71117  -0.480  0.63115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 214.38  on 456  degrees of freedom
## AIC: 234.38
## 
## Number of Fisher Scoring iterations: 8
# Check model diagnostics
cat("\nAIC:", AIC(model1))
## 
## AIC: 234.3816
cat("\nBIC:", BIC(model1))
## 
## BIC: 275.8234
Rationale: This baseline model uses all available predictors to establish a benchmark. It helps us understand which variables are significant before attempting reduction.
This model uses automated variable selection to find the optimal subset of predictors.
# Stepwise selection (both directions)
model2 <- step(model1, direction = "both", trace = 0)
summary(model2)
## 
## Call:
## glm(formula = target ~ zn + chas + rm + age + dis + rad + ptratio + 
##     nox_log, family = binomial(link = "logit"), data = crime_train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.968582   3.153009  -0.942  0.34644    
## zn          -0.051578   0.026991  -1.911  0.05601 .  
## chas         1.172696   0.730163   1.606  0.10826    
## rm           1.002436   0.320050   3.132  0.00174 ** 
## age          0.014665   0.009396   1.561  0.11857    
## dis          0.579245   0.187908   3.083  0.00205 ** 
## rad          0.484114   0.122096   3.965 7.34e-05 ***
## ptratio      0.153904   0.093670   1.643  0.10037    
## nox_log     19.059526   3.150708   6.049 1.45e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 214.61  on 457  degrees of freedom
## AIC: 232.61
## 
## Number of Fisher Scoring iterations: 8
cat("\nAIC:", AIC(model2))
## 
## AIC: 232.6136
cat("\nBIC:", BIC(model2))
## 
## BIC: 269.9113
# Show which variables were retained
cat("\nVariables retained in Model 2:\n")
## 
## Variables retained in Model 2:
names(coef(model2))
## [1] "(Intercept)" "zn"          "chas"        "rm"          "age"        
## [6] "dis"         "rad"         "ptratio"     "nox_log"
Rationale: Stepwise selection balances model complexity with performance by automatically removing non-significant predictors based on AIC. This often produces a more parsimonious model.
This model is based on domain knowledge and includes interaction terms that make theoretical sense.
# Model with key predictors and interactions
model3 <- glm(target ~ rad + nox_log + lstat_log + rm + rm_sq + 
              rm_lstat_i + dis + age, 
              data = crime_train, 
              family = binomial(link = "logit"))
summary(model3)
## 
## Call:
## glm(formula = target ~ rad + nox_log + lstat_log + rm + rm_sq + 
##     rm_lstat_i + dis + age, family = binomial(link = "logit"), 
##     data = crime_train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  95.95709   41.82573   2.294  0.02178 *  
## rad           0.46541    0.11477   4.055 5.01e-05 ***
## nox_log      16.84314    2.95326   5.703 1.18e-08 ***
## lstat_log    -8.19368    6.75289  -1.213  0.22499    
## rm          -25.29874   10.75599  -2.352  0.01867 *  
## rm_sq         1.74221    0.66427   2.623  0.00872 ** 
## rm_lstat_i    1.18709    1.06221   1.118  0.26375    
## dis           0.40312    0.16074   2.508  0.01214 *  
## age           0.02871    0.01160   2.475  0.01333 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 214.72  on 457  degrees of freedom
## AIC: 232.72
## 
## Number of Fisher Scoring iterations: 8
cat("\nAIC:", AIC(model3))
## 
## AIC: 232.7167
cat("\nBIC:", BIC(model3))
## 
## BIC: 270.0144
Rationale:
rad: Highway accessibility is strongly correlated with crime
nox_log: Air pollution indicates industrial/urban areas
lstat_log: Socioeconomic status is a key crime predictor
rm + rm_sq: Captures non-linear relationship between housing size and crime
rm_lstat_i: Tests if housing size effect varies by neighborhood wealth
dis: Distance from employment centers
age: Older housing stock may indicate different crime patterns
# Create a comparison table of coefficients
library(broom)
coef_comparison <- bind_rows(
  tidy(model1) %>% mutate(Model = "Model 1: Full"),
  tidy(model2) %>% mutate(Model = "Model 2: Stepwise"),
  tidy(model3) %>% mutate(Model = "Model 3: Interactions")
) %>%
  select(Model, term, estimate, std.error, p.value) %>%
  arrange(term, Model)
knitr::kable(coef_comparison, digits = 3, 
             caption = "Coefficient Comparison Across Models")
| Model | term | estimate | std.error | p.value | 
|---|---|---|---|---|
| Model 1: Full | (Intercept) | -1.407 | 4.536 | 0.756 | 
| Model 2: Stepwise | (Intercept) | -2.969 | 3.153 | 0.346 | 
| Model 3: Interactions | (Intercept) | 95.957 | 41.826 | 0.022 | 
| Model 1: Full | age | 0.017 | 0.011 | 0.114 | 
| Model 2: Stepwise | age | 0.015 | 0.009 | 0.119 | 
| Model 3: Interactions | age | 0.029 | 0.012 | 0.013 | 
| Model 1: Full | chas | 1.222 | 0.745 | 0.101 | 
| Model 2: Stepwise | chas | 1.173 | 0.730 | 0.108 | 
| Model 1: Full | dis | 0.585 | 0.189 | 0.002 | 
| Model 2: Stepwise | dis | 0.579 | 0.188 | 0.002 | 
| Model 3: Interactions | dis | 0.403 | 0.161 | 0.012 | 
| Model 1: Full | lstat_log | -0.341 | 0.711 | 0.631 | 
| Model 3: Interactions | lstat_log | -8.194 | 6.753 | 0.225 | 
| Model 1: Full | nox_log | 19.222 | 3.180 | 0.000 | 
| Model 2: Stepwise | nox_log | 19.060 | 3.151 | 0.000 | 
| Model 3: Interactions | nox_log | 16.843 | 2.953 | 0.000 | 
| Model 1: Full | ptratio | 0.161 | 0.095 | 0.090 | 
| Model 2: Stepwise | ptratio | 0.154 | 0.094 | 0.100 | 
| Model 1: Full | rad | 0.490 | 0.123 | 0.000 | 
| Model 2: Stepwise | rad | 0.484 | 0.122 | 0.000 | 
| Model 3: Interactions | rad | 0.465 | 0.115 | 0.000 | 
| Model 1: Full | rm | 0.842 | 0.460 | 0.067 | 
| Model 2: Stepwise | rm | 1.002 | 0.320 | 0.002 | 
| Model 3: Interactions | rm | -25.299 | 10.756 | 0.019 | 
| Model 3: Interactions | rm_lstat_i | 1.187 | 1.062 | 0.264 | 
| Model 3: Interactions | rm_sq | 1.742 | 0.664 | 0.009 | 
| Model 1: Full | zn | -0.050 | 0.027 | 0.065 | 
| Model 2: Stepwise | zn | -0.052 | 0.027 | 0.056 | 
# Exponentiate coefficients to get odds ratios
cat("\n### Odds Ratios for Model 2 (Stepwise):\n")
## 
## ### Odds Ratios for Model 2 (Stepwise):
odds_ratios <- exp(coef(model2))
print(round(odds_ratios, 3))
##  (Intercept)           zn         chas           rm          age          dis 
## 5.100000e-02 9.500000e-01 3.231000e+00 2.725000e+00 1.015000e+00 1.785000e+00 
##          rad      ptratio      nox_log 
## 1.623000e+00 1.166000e+00 1.894292e+08
# Extract signs and significance for Model 2 (most parsimonious)
model2_coefs <- tidy(model2) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    sign = ifelse(estimate > 0, "Positive", "Negative"),
    significant = ifelse(p.value < 0.05, "Yes", "No")
  ) %>%
  select(term, estimate, sign, significant, p.value)
knitr::kable(model2_coefs, digits = 3,
             caption = "Model 2 Coefficient Analysis")
| term | estimate | sign | significant | p.value | 
|---|---|---|---|---|
| zn | -0.052 | Negative | No | 0.056 | 
| chas | 1.173 | Positive | No | 0.108 | 
| rm | 1.002 | Positive | Yes | 0.002 | 
| age | 0.015 | Positive | No | 0.119 | 
| dis | 0.579 | Positive | Yes | 0.002 | 
| rad | 0.484 | Positive | Yes | 0.000 | 
| ptratio | 0.154 | Positive | No | 0.100 | 
| nox_log | 19.060 | Positive | Yes | 0.000 | 
Expected Relationships:
rad (highway access) - POSITIVE coefficient expected
Better highway access → more crime
Makes sense: easier escape routes, more transient population
nox_log (pollution) - POSITIVE coefficient expected
Higher pollution → industrial areas → more crime
Makes sense: industrial areas often have higher crime
lstat_log (lower status %) - POSITIVE coefficient expected
Lower socioeconomic status → more crime
Makes sense: economic hardship correlates with crime
rm (rooms) - NEGATIVE coefficient expected
Larger homes → wealthier areas → less crime
Makes sense: housing size indicates affluence
dis (distance to employment) - NEGATIVE coefficient expected
Further from city centers → suburban areas → less crime
Makes sense: urban cores typically have higher crime
age (old housing %) - POSITIVE coefficient expected
Older housing stock → deteriorated neighborhoods → more crime
Check if this holds in your models
Counter-Intuitive Results to Watch For:
If any of these appear opposite to expectations, consider:
Is the variable capturing something else due to multicollinearity?
Could there be a non-linear relationship we’re missing?
Does it make sense in the context of Boston neighborhoods?
# Compare model performance metrics
model_comparison <- data.frame(
  Model = c("Model 1: Full", "Model 2: Stepwise", "Model 3: Interactions"),
  Variables = c(
    length(coef(model1)) - 1,
    length(coef(model2)) - 1,
    length(coef(model3)) - 1
  ),
  AIC = c(AIC(model1), AIC(model2), AIC(model3)),
  BIC = c(BIC(model1), BIC(model2), BIC(model3)),
  Deviance = c(deviance(model1), deviance(model2), deviance(model3))
)
knitr::kable(model_comparison, digits = 2,
             caption = "Model Comparison Statistics")
| Model | Variables | AIC | BIC | Deviance | 
|---|---|---|---|---|
| Model 1: Full | 9 | 234.38 | 275.82 | 214.38 | 
| Model 2: Stepwise | 8 | 232.61 | 269.91 | 214.61 | 
| Model 3: Interactions | 8 | 232.72 | 270.01 | 214.72 | 
Why Keep Counter-Intuitive Variables?
If a variable has a counter-intuitive coefficient but:
Is statistically significant (p < 0.05)
Improves model fit (lower AIC/BIC)
Makes sense when controlling for other variables
Then YES, keep it. Explain to the boss that:
The simple bivariate relationship may be misleading
After controlling for confounders, the true effect emerges
The data tells a more complex story than intuition suggests
# Generate predictions for all three models on training data
pred_prob1 <- predict(model1, crime_train, type = "response")
pred_prob2 <- predict(model2, crime_train, type = "response")
pred_prob3 <- predict(model3, crime_train, type = "response")
# Generate predictions for evaluation data
eval_pred_prob1 <- predict(model1, crime_eval, type = "response")
eval_pred_prob2 <- predict(model2, crime_eval, type = "response")
eval_pred_prob3 <- predict(model3, crime_eval, type = "response")
# Apply 0.5 threshold for classifications
pred_class1 <- ifelse(pred_prob1 > 0.5, 1, 0)
pred_class2 <- ifelse(pred_prob2 > 0.5, 1, 0)
pred_class3 <- ifelse(pred_prob3 > 0.5, 1, 0)
eval_pred_class1 <- ifelse(eval_pred_prob1 > 0.5, 1, 0)
eval_pred_class2 <- ifelse(eval_pred_prob2 > 0.5, 1, 0)
eval_pred_class3 <- ifelse(eval_pred_prob3 > 0.5, 1, 0)
cat("Predictions prepared for Part 4 model evaluation\n")
## Predictions prepared for Part 4 model evaluation
Model Selection Our selected model is Model 2.
Predictive Performance: Model 3 was outperformed in every confusion matrix related metric such as precision, specificity, accuracy, sensitivity and F1 score. Model 1 performed the best, but was usually only slightly ahead of Model 2. The ROC curve for Model 3 had an irregular shape implying it predicts especially poorly sometimes.
Variables Align with Theoretical Effect: Model 3 has the most significant variables with 6/8. Model 1 only has 3/9 (although a few variables were close to the 0.05 threshold). Model 2 has 4/8 significant variables with one sitting at 0.056 and barely missing the cut.
Statistical Performance: Model 2 had the best AIC and BIC of the models compared while Model 1 was the worst. This is extra important because Model 2 is a reduced version of Model 1. Being the better fit indicates that unless Model 2 performed notably poorly elsewhere, it should be ahead of Model 1 as a choice. Model 3 performed respectably in AIC and BIC, but was still behind Model 2.
Model 2, by means of being top 2 or the best in every comparison, emerged as the most consistently good model we had. It was tied with Model 3 as our most parsimonious model with strong predictive ability. While the significant variable metrics boost Model 3, we need to consider purpose. We can view our decision as predictive vs. exploratory analysis. Model 3 might have been chosen if we wanted to study the relationship between our factors and high crime rates. Instead, we want to predict whether or not a neighborhood will have high crime rates. This means that Model 2 would be our simplest model that allows us to most accurately predict a neighborhood’s crime rate.
For binary logistic regression, we have the following criteria for selecting a model by goodness of fit:
For numeric criteria, if the values are similar, we will consider other factors as well to sway our decision. Similar values imply that the models may have similar goodness of fit levels. It is typical to examine these values while creating models, but we will review them here:
knitr::kable(model_comparison, digits = 2,
             caption = "Model Comparison Statistics")
| Model | Variables | AIC | BIC | Deviance | 
|---|---|---|---|---|
| Model 1: Full | 9 | 234.38 | 275.82 | 214.38 | 
| Model 2: Stepwise | 8 | 232.61 | 269.91 | 214.61 | 
| Model 3: Interactions | 8 | 232.72 | 270.01 | 214.72 | 
Lower is better for AIC and BIC. Accordingly, Model 2 is the best with the lowest values in each, but not by a huge margin over Model 3.
A ROC curve, along with the AUC, can be used to test how well the models make predictions. Let’s see how each model looks.
roc_curve <- function(model, title) {
  probabilities <- predict(model, type = "response")
  roc_obj <- roc(response = crime_train$target, predictor = probabilities)
  return (plot(roc_obj, main = title,col = "blue", lwd = 2, print.auc = TRUE))
}
roc1 <- roc_curve(model1, "ROC Curve for Model 1")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc2 <- roc_curve(model2, "ROC Curve for Model 2")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
roc3 <- roc_curve(model3, "ROC Curve for Model 3")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
The AUC is almost identical for all 3 models of over 0.96. They all have outstanding performance with a huge area under the curve, but Model 3 appears to have a high amount of false positives according to the depressed curve in the middle of the general curve. It’s still not a bad model, but something to consider later.
Additional metrics we can check are:
conf_matrix_summary <- function(predicted, title) {
  actual <- crime_train$target
  conf_matrix_result <- confusionMatrix(predicted, actual)
  accuracy_value <- conf_matrix_result$overall['Accuracy']
  precision_value <- conf_matrix_result$byClass['Pos Pred Value']
  sensitivity_value <- sensitivity(predicted, actual)
  specificity_value <- specificity(predicted, actual)
  f1_value <- conf_matrix_result$byClass['F1']
  cat(
    "\n\n", title,
    "\nAccuracy: ", accuracy_value, 
    "\nClassification error rate: ", 1 - accuracy_value,
    "\nPrecision: ", precision_value,
    "\nSensitivity: ", sensitivity_value,
    "\nSpecificity: ", specificity_value,
    "\nF1 score: ", f1_value
  )
}
conf_matrix_summary(as.factor(pred_class1), "Model 1 Summary")
## 
## 
##  Model 1 Summary 
## Accuracy:  0.8905579 
## Classification error rate:  0.1094421 
## Precision:  0.8811475 
## Sensitivity:  0.907173 
## Specificity:  0.8733624 
## F1 score:  0.8939709
conf_matrix_summary(as.factor(pred_class2), "Model 2 Summary")
## 
## 
##  Model 2 Summary 
## Accuracy:  0.8862661 
## Classification error rate:  0.1137339 
## Precision:  0.8739837 
## Sensitivity:  0.907173 
## Specificity:  0.8646288 
## F1 score:  0.8902692
conf_matrix_summary(as.factor(pred_class3), "Model 3 Summary")
## 
## 
##  Model 3 Summary 
## Accuracy:  0.8669528 
## Classification error rate:  0.1330472 
## Precision:  0.8486056 
## Sensitivity:  0.8987342 
## Specificity:  0.8340611 
## F1 score:  0.8729508
Model 1 performed the best in almost every metric. Model 1 was fit with the most variables, so that is not a surprise. Model 3 performed the worst in every metric we evaluated so far. Model 2 has the lowest AIC and BIC and otherwise has performed similarly to Model 1. It is our worst fit while the other two models perform more evenly.
We will now use McFadden’s Pseudo-R-squared to compare the models. This uses log-likelihood and is especially useful when you have nested models which is true of Models 1 and 2.
PseudoR2(model1, which = "McFadden")
##  McFadden 
## 0.6680762
PseudoR2(model2, which = "McFadden")
##  McFadden 
## 0.6677169
Model 1 has a higher value of 0.668 than the 0.667 of Model 2 indicating that it is the better fit.
We have established that in terms of fit, the order goes Model 1 > Model 2 > Model 3. We will now determine whether or not Model 1 is truly our best model.
anova_m1_2 <- anova(model2, model1, test = "Chisq")
anova_m1_2
## Analysis of Deviance Table
## 
## Model 1: target ~ zn + chas + rm + age + dis + rad + ptratio + nox_log
## Model 2: target ~ zn + chas + rm + age + dis + rad + ptratio + nox_log + 
##     lstat_log
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       457     214.61                     
## 2       456     214.38  1  0.23207     0.63
An ANOVA test can be used to see how significant the difference between the models is when one model is a subset of the other. Our deviance value of 0.232 is fairly low and suggests that Model 1 does not gain much from having the extra variable.
Interpretation:
The different variable, lstat_log, is now the point of contention. This is the log transformed lstat variable which indicates the percentage of the population that is of lower status. Theoretically, the higher this percentage is, the more crime is expected since there is more incentive to commit crime. Model 1 treats this as a negative coefficient. This provides some reason to doubt that Model 1 is actually our best choice. The improvement in predictive ability of Model 1 (confusion matrix) is relatively small over Model 2 with goodness of fit (AIC/BIC) actually getting worse. Model 2 would be the parsimonious pick even if we ignore the odd implications of the lstat_log predictor.
#final predictions from chosen model
submission <- crime_eval_data
submission$TARGET_CRIME <- eval_pred_prob2
submission$TARGET_CLASS <- eval_pred_class2
#saving the predictions to a new file
write.csv(submission, "crime-predictions.csv", row.names = FALSE)
cat("Final predictions saved.\n")
## Final predictions saved.