Part 1 - Data Exploration

There are 466 rows and 13 columns found in the crime training data set. Each column represents a variable, and there are 12 predictor variables and 1 response variable.

Response Variable

target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)

Predictor Variables

zn: proportion of residential land zoned for large lots (over 25000 square feet)

indus: proportion of non-retail business acres per suburb

chas: a dummy variable for whether the suburb borders the Charles River (1) or not (0)

nox: nitrogen oxides concentration (parts per 10 million)

rm: average number of rooms per dwelling

age: proportion of owner-occupied units built prior to 1940

dis: weighted mean of distances to five Boston employment centers

rad: index of accessibility to radial highways

tax: full-value property-tax rate per $10,000

ptratio: pupil-teacher ratio by town

lstat: lower status of the population (percent)

medv: median value of owner-occupied homes in $1000s

Missing Data

Is there are missing data in training data set?

##    Variable miss_data_Values
## 1        zn                0
## 2     indus                0
## 3      chas                0
## 4       nox                0
## 5        rm                0
## 6       age                0
## 7       dis                0
## 8       rad                0
## 9       tax                0
## 10  ptratio                0
## 11    lstat                0
## 12     medv                0
## 13   target                0

There are no missing data values found in any of the predictor variables and the response variable.

Descriptive Statistics - Part 1
vars n mean sd median trimmed
zn 1 466 11.5772532 23.3646511 0.00000 5.3542781
indus 2 466 11.1050215 6.8458549 9.69000 10.9082353
chas 3 466 0.0708155 0.2567920 0.00000 0.0000000
nox 4 466 0.5543105 0.1166667 0.53800 0.5442684
rm 5 466 6.2906738 0.7048513 6.21000 6.2570615
age 6 466 68.3675966 28.3213784 77.15000 70.9553476
dis 7 466 3.7956929 2.1069496 3.19095 3.5443647
rad 8 466 9.5300429 8.6859272 5.00000 8.6978610
tax 9 466 409.5021459 167.9000887 334.50000 401.5080214
ptratio 10 466 18.3984979 2.1968447 18.90000 18.5970588
lstat 11 466 12.6314592 7.1018907 11.35000 11.8809626
medv 12 466 22.5892704 9.2396814 21.20000 21.6304813
target 13 466 0.4914163 0.5004636 0.00000 0.4893048
Descriptive Statistics - Part 2
mad min max range skew kurtosis se
zn 0.0000000 0.0000 100.0000 100.0000 2.1768152 3.8135765 1.0823466
indus 9.3403800 0.4600 27.7400 27.2800 0.2885450 -1.2432132 0.3171281
chas 0.0000000 0.0000 1.0000 1.0000 3.3354899 9.1451313 0.0118957
nox 0.1334340 0.3890 0.8710 0.4820 0.7463281 -0.0357736 0.0054045
rm 0.5166861 3.8630 8.7800 4.9170 0.4793202 1.5424378 0.0326516
age 30.0226500 2.9000 100.0000 97.1000 -0.5777075 -1.0098814 1.3119625
dis 1.9144814 1.1296 12.1265 10.9969 0.9988926 0.4719679 0.0976026
rad 1.4826000 1.0000 24.0000 23.0000 1.0102788 -0.8619110 0.4023678
tax 104.5233000 187.0000 711.0000 524.0000 0.6593136 -1.1480456 7.7778214
ptratio 1.9273800 12.6000 22.0000 9.4000 -0.7542681 -0.4003627 0.1017669
lstat 7.0720020 1.7300 37.9700 36.2400 0.9055864 0.5033688 0.3289887
medv 6.0045300 5.0000 50.0000 45.0000 1.0766920 1.3737825 0.4280200
target 0.0000000 0.0000 1.0000 1.0000 0.0342293 -2.0031131 0.0231835

From the box plot of the training data, we can see that the many of the variables are skewed in their distribution. Variables such as medv, rm and zn have an enormous amount of outliers. However, let’s group the variables by the target variable to test if it plays a factor.

Through grouping the predictors in our training data set with the response variable, we can see that the variables are correlated to the target. This means that while before we only saw that there were a ton of outliers in some of our variables, they are related to the data in a significant way and should be included in our modeling.

Correlation

To check if the data is correlated to the target variable or to any other variable, we can begin by creating a correlation matrix.

Positive and Negative Correlations with Data

Positive Correlations with Target: Variables zn and chas show positive correlations with crime rate. This indicates that areas with larger residential lots and proximity to the Charles River tend to have higher crime rates.

Negative Correlations with Target: Variables indus, nox, age, dis, rad, tax, ptratio, lstat, and medv show negative correlations with crime rate. This suggests that areas with higher industrial presence, pollution levels, older housing stock, greater distance from employment centers, better highway access, higher tax rates, higher pupil-teacher ratios, lower socio-economic status, and higher median home values tend to have lower crime rates.

Notably the correlation matrix reveals strong correlations between tax and rad at 0.91 and many other variables have significant correlation at 0.70, so these variable relationships should be taken into consideration building predictive models to avoid potential multicollinearity issues.

To identify these high correlations between variables, we devised a list of variable relationships to consider:

var1 var2 correlation
tax rad 0.906
rad tax 0.906
dis nox -0.769
nox dis -0.769
nox indus 0.760
indus nox 0.760
dis age -0.751
age dis -0.751
medv lstat -0.736
lstat medv -0.736
age nox 0.735
nox age 0.735
tax indus 0.732
indus tax 0.732
target nox 0.726
nox target 0.726
medv rm 0.705
rm medv 0.705
dis indus -0.704
indus dis -0.704

Part 2 – Data Preparation

The goal of this step was to clean the data and prepare it for accurate modeling. We checked for missing values, improved variable quality through transformations and new features, and reduced overlap between predictors to avoid multicollinearity.

Missing Data: We checked each column in the training dataset for missing values. All results were zero, meaning the dataset was complete and required no imputation.

Transformations: Two variables, NOX (air pollution) and LSTAT (lower-status population), were right-skewed. Log transformations were applied to both to reduce skew and make relationships more linear.

Feature Engineering: Two new variables were created: - rm_sq: RM squared (RM²) to capture a curved relationship between housing size and crime - rm_lstat_i: RM × LSTAT to test whether the impact of housing size changes across different income levels

Multicollinearity Reduction: Several variables were highly correlated. To improve model stability, we: - Kept RAD, dropped TAX (both measured road access, correlation = 0.91) - Kept NOX, dropped INDUS (both described industrial activity) - Kept LSTAT, dropped MEDV (both reflected neighborhood wealth)

# Create transformed training data
crime_train <- crime_train_data %>%
  mutate(
    target = factor(target, levels = c(0, 1)),
    nox_log = log(nox),
    lstat_log = log1p(lstat)
  ) %>%
  select(-tax, -indus, -medv, -nox, -lstat) %>%
  mutate(
    rm_sq = rm^2,
    rm_lstat_i = rm * lstat_log
  )

# Create transformed evaluation data
crime_eval <- crime_eval_data %>%
  mutate(
    nox_log = log(nox),
    lstat_log = log1p(lstat)
  ) %>%
  select(-tax, -indus, -medv, -nox, -lstat) %>%
  mutate(
    rm_sq = rm^2,
    rm_lstat_i = rm * lstat_log
  )

# Check VIF on core variables (excluding interaction terms)
library(car)
vif_values <- vif(glm(target ~ zn + chas + rm + age + dis + rad + ptratio + 
                      nox_log + lstat_log, 
                      data = crime_train, family = binomial()))

knitr::kable(data.frame(Variable = names(vif_values), VIF = vif_values),
             caption = "Variance Inflation Factors After Preparation",
             digits = 2, row.names = FALSE)
Variance Inflation Factors After Preparation
Variable VIF
zn 1.74
chas 1.14
rm 2.90
age 2.04
dis 3.20
rad 1.23
ptratio 1.60
nox_log 3.12
lstat_log 3.44

After these adjustments, all Variance Inflation Factor (VIF) values are below 5, confirming that multicollinearity has been resolved.

Final Dataset

The prepared dataset includes the following predictors:

Core variables: zn, chas, rm, age, dis, rad, ptratio

Transformed variables: nox_log, lstat_log

Engineered features (for Model 3): rm_sq, rm_lstat_i

The response variable is target (1 = high crime, 0 = low crime).

Models 1 and 2 use the core and transformed variables, while Model 3 additionally incorporates the engineered interaction and polynomial terms to test non-linear relationships.

Part 3 - BUILD MODELS

Model 1: Full Model with All Prepared Variables

This model includes all core transformed variables

# Full model with all prepared predictors

model1 <- glm(target ~ zn + chas + rm + age + dis + rad + ptratio + 

              nox_log + lstat_log, 

              data = crime_train, 

              family = binomial(link = "logit"))

summary(model1)
## 
## Call:
## glm(formula = target ~ zn + chas + rm + age + dis + rad + ptratio + 
##     nox_log + lstat_log, family = binomial(link = "logit"), data = crime_train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.40666    4.53605  -0.310  0.75648    
## zn          -0.04957    0.02689  -1.843  0.06527 .  
## chas         1.22174    0.74450   1.641  0.10079    
## rm           0.84225    0.46020   1.830  0.06722 .  
## age          0.01738    0.01098   1.582  0.11356    
## dis          0.58463    0.18870   3.098  0.00195 ** 
## rad          0.48961    0.12264   3.992 6.55e-05 ***
## ptratio      0.16122    0.09508   1.696  0.08996 .  
## nox_log     19.22228    3.17966   6.045 1.49e-09 ***
## lstat_log   -0.34144    0.71117  -0.480  0.63115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 214.38  on 456  degrees of freedom
## AIC: 234.38
## 
## Number of Fisher Scoring iterations: 8
# Check model diagnostics

cat("\nAIC:", AIC(model1))
## 
## AIC: 234.3816
cat("\nBIC:", BIC(model1))
## 
## BIC: 275.8234

Rationale: This baseline model uses all available predictors to establish a benchmark. It helps us understand which variables are significant before attempting reduction.


Model 2: Stepwise Selection Model

This model uses automated variable selection to find the optimal subset of predictors.

# Stepwise selection (both directions)

model2 <- step(model1, direction = "both", trace = 0)

summary(model2)
## 
## Call:
## glm(formula = target ~ zn + chas + rm + age + dis + rad + ptratio + 
##     nox_log, family = binomial(link = "logit"), data = crime_train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.968582   3.153009  -0.942  0.34644    
## zn          -0.051578   0.026991  -1.911  0.05601 .  
## chas         1.172696   0.730163   1.606  0.10826    
## rm           1.002436   0.320050   3.132  0.00174 ** 
## age          0.014665   0.009396   1.561  0.11857    
## dis          0.579245   0.187908   3.083  0.00205 ** 
## rad          0.484114   0.122096   3.965 7.34e-05 ***
## ptratio      0.153904   0.093670   1.643  0.10037    
## nox_log     19.059526   3.150708   6.049 1.45e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 214.61  on 457  degrees of freedom
## AIC: 232.61
## 
## Number of Fisher Scoring iterations: 8
cat("\nAIC:", AIC(model2))
## 
## AIC: 232.6136
cat("\nBIC:", BIC(model2))
## 
## BIC: 269.9113
# Show which variables were retained

cat("\nVariables retained in Model 2:\n")
## 
## Variables retained in Model 2:
names(coef(model2))
## [1] "(Intercept)" "zn"          "chas"        "rm"          "age"        
## [6] "dis"         "rad"         "ptratio"     "nox_log"

Rationale: Stepwise selection balances model complexity with performance by automatically removing non-significant predictors based on AIC. This often produces a more parsimonious model.


Model 3: Theory-Driven Model with Interactions

This model is based on domain knowledge and includes interaction terms that make theoretical sense.

# Model with key predictors and interactions

model3 <- glm(target ~ rad + nox_log + lstat_log + rm + rm_sq + 

              rm_lstat_i + dis + age, 

              data = crime_train, 

              family = binomial(link = "logit"))

summary(model3)
## 
## Call:
## glm(formula = target ~ rad + nox_log + lstat_log + rm + rm_sq + 
##     rm_lstat_i + dis + age, family = binomial(link = "logit"), 
##     data = crime_train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  95.95709   41.82573   2.294  0.02178 *  
## rad           0.46541    0.11477   4.055 5.01e-05 ***
## nox_log      16.84314    2.95326   5.703 1.18e-08 ***
## lstat_log    -8.19368    6.75289  -1.213  0.22499    
## rm          -25.29874   10.75599  -2.352  0.01867 *  
## rm_sq         1.74221    0.66427   2.623  0.00872 ** 
## rm_lstat_i    1.18709    1.06221   1.118  0.26375    
## dis           0.40312    0.16074   2.508  0.01214 *  
## age           0.02871    0.01160   2.475  0.01333 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 214.72  on 457  degrees of freedom
## AIC: 232.72
## 
## Number of Fisher Scoring iterations: 8
cat("\nAIC:", AIC(model3))
## 
## AIC: 232.7167
cat("\nBIC:", BIC(model3))
## 
## BIC: 270.0144

Rationale:

  • rad: Highway accessibility is strongly correlated with crime

  • nox_log: Air pollution indicates industrial/urban areas

  • lstat_log: Socioeconomic status is a key crime predictor

  • rm + rm_sq: Captures non-linear relationship between housing size and crime

  • rm_lstat_i: Tests if housing size effect varies by neighborhood wealth

  • dis: Distance from employment centers

  • age: Older housing stock may indicate different crime patterns


Coefficient Interpretation and Model Sense-Making

# Create a comparison table of coefficients

library(broom)

coef_comparison <- bind_rows(

  tidy(model1) %>% mutate(Model = "Model 1: Full"),

  tidy(model2) %>% mutate(Model = "Model 2: Stepwise"),

  tidy(model3) %>% mutate(Model = "Model 3: Interactions")

) %>%

  select(Model, term, estimate, std.error, p.value) %>%

  arrange(term, Model)

knitr::kable(coef_comparison, digits = 3, 

             caption = "Coefficient Comparison Across Models")
Coefficient Comparison Across Models
Model term estimate std.error p.value
Model 1: Full (Intercept) -1.407 4.536 0.756
Model 2: Stepwise (Intercept) -2.969 3.153 0.346
Model 3: Interactions (Intercept) 95.957 41.826 0.022
Model 1: Full age 0.017 0.011 0.114
Model 2: Stepwise age 0.015 0.009 0.119
Model 3: Interactions age 0.029 0.012 0.013
Model 1: Full chas 1.222 0.745 0.101
Model 2: Stepwise chas 1.173 0.730 0.108
Model 1: Full dis 0.585 0.189 0.002
Model 2: Stepwise dis 0.579 0.188 0.002
Model 3: Interactions dis 0.403 0.161 0.012
Model 1: Full lstat_log -0.341 0.711 0.631
Model 3: Interactions lstat_log -8.194 6.753 0.225
Model 1: Full nox_log 19.222 3.180 0.000
Model 2: Stepwise nox_log 19.060 3.151 0.000
Model 3: Interactions nox_log 16.843 2.953 0.000
Model 1: Full ptratio 0.161 0.095 0.090
Model 2: Stepwise ptratio 0.154 0.094 0.100
Model 1: Full rad 0.490 0.123 0.000
Model 2: Stepwise rad 0.484 0.122 0.000
Model 3: Interactions rad 0.465 0.115 0.000
Model 1: Full rm 0.842 0.460 0.067
Model 2: Stepwise rm 1.002 0.320 0.002
Model 3: Interactions rm -25.299 10.756 0.019
Model 3: Interactions rm_lstat_i 1.187 1.062 0.264
Model 3: Interactions rm_sq 1.742 0.664 0.009
Model 1: Full zn -0.050 0.027 0.065
Model 2: Stepwise zn -0.052 0.027 0.056
# Exponentiate coefficients to get odds ratios

cat("\n### Odds Ratios for Model 2 (Stepwise):\n")
## 
## ### Odds Ratios for Model 2 (Stepwise):
odds_ratios <- exp(coef(model2))

print(round(odds_ratios, 3))
##  (Intercept)           zn         chas           rm          age          dis 
## 5.100000e-02 9.500000e-01 3.231000e+00 2.725000e+00 1.015000e+00 1.785000e+00 
##          rad      ptratio      nox_log 
## 1.623000e+00 1.166000e+00 1.894292e+08

Actual Coefficient Signs in Our Models

# Extract signs and significance for Model 2 (most parsimonious)
model2_coefs <- tidy(model2) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    sign = ifelse(estimate > 0, "Positive", "Negative"),
    significant = ifelse(p.value < 0.05, "Yes", "No")
  ) %>%
  select(term, estimate, sign, significant, p.value)

knitr::kable(model2_coefs, digits = 3,
             caption = "Model 2 Coefficient Analysis")
Model 2 Coefficient Analysis
term estimate sign significant p.value
zn -0.052 Negative No 0.056
chas 1.173 Positive No 0.108
rm 1.002 Positive Yes 0.002
age 0.015 Positive No 0.119
dis 0.579 Positive Yes 0.002
rad 0.484 Positive Yes 0.000
ptratio 0.154 Positive No 0.100
nox_log 19.060 Positive Yes 0.000

Do the Coefficients Make Sense?

Expected Relationships:

  1. rad (highway access) - POSITIVE coefficient expected

    • Better highway access → more crime

    • Makes sense: easier escape routes, more transient population

  2. nox_log (pollution) - POSITIVE coefficient expected

    • Higher pollution → industrial areas → more crime

    • Makes sense: industrial areas often have higher crime

  3. lstat_log (lower status %) - POSITIVE coefficient expected

    • Lower socioeconomic status → more crime

    • Makes sense: economic hardship correlates with crime

  4. rm (rooms) - NEGATIVE coefficient expected

    • Larger homes → wealthier areas → less crime

    • Makes sense: housing size indicates affluence

  5. dis (distance to employment) - NEGATIVE coefficient expected

    • Further from city centers → suburban areas → less crime

    • Makes sense: urban cores typically have higher crime

  6. age (old housing %) - POSITIVE coefficient expected

    • Older housing stock → deteriorated neighborhoods → more crime

    • Check if this holds in your models

Counter-Intuitive Results to Watch For:

If any of these appear opposite to expectations, consider:

  • Is the variable capturing something else due to multicollinearity?

  • Could there be a non-linear relationship we’re missing?

  • Does it make sense in the context of Boston neighborhoods?


Model Comparison Summary

# Compare model performance metrics

model_comparison <- data.frame(

  Model = c("Model 1: Full", "Model 2: Stepwise", "Model 3: Interactions"),

  Variables = c(

    length(coef(model1)) - 1,

    length(coef(model2)) - 1,

    length(coef(model3)) - 1

  ),

  AIC = c(AIC(model1), AIC(model2), AIC(model3)),

  BIC = c(BIC(model1), BIC(model2), BIC(model3)),

  Deviance = c(deviance(model1), deviance(model2), deviance(model3))

)

knitr::kable(model_comparison, digits = 2,

             caption = "Model Comparison Statistics")
Model Comparison Statistics
Model Variables AIC BIC Deviance
Model 1: Full 9 234.38 275.82 214.38
Model 2: Stepwise 8 232.61 269.91 214.61
Model 3: Interactions 8 232.72 270.01 214.72

Why Keep Counter-Intuitive Variables?

If a variable has a counter-intuitive coefficient but:

  • Is statistically significant (p < 0.05)

  • Improves model fit (lower AIC/BIC)

  • Makes sense when controlling for other variables

Then YES, keep it. Explain to the boss that:

  1. The simple bivariate relationship may be misleading

  2. After controlling for confounders, the true effect emerges

  3. The data tells a more complex story than intuition suggests


Preparing Predictions for Model Evaluation (Part 4)

# Generate predictions for all three models on training data
pred_prob1 <- predict(model1, crime_train, type = "response")
pred_prob2 <- predict(model2, crime_train, type = "response")
pred_prob3 <- predict(model3, crime_train, type = "response")

# Generate predictions for evaluation data
eval_pred_prob1 <- predict(model1, crime_eval, type = "response")
eval_pred_prob2 <- predict(model2, crime_eval, type = "response")
eval_pred_prob3 <- predict(model3, crime_eval, type = "response")

# Apply 0.5 threshold for classifications
pred_class1 <- ifelse(pred_prob1 > 0.5, 1, 0)
pred_class2 <- ifelse(pred_prob2 > 0.5, 1, 0)
pred_class3 <- ifelse(pred_prob3 > 0.5, 1, 0)

eval_pred_class1 <- ifelse(eval_pred_prob1 > 0.5, 1, 0)
eval_pred_class2 <- ifelse(eval_pred_prob2 > 0.5, 1, 0)
eval_pred_class3 <- ifelse(eval_pred_prob3 > 0.5, 1, 0)

cat("Predictions prepared for Part 4 model evaluation\n")
## Predictions prepared for Part 4 model evaluation

Part 4 - Model Selection

Model Selection Our selected model is Model 2.

  1. Predictive Performance: Model 3 was outperformed in every confusion matrix related metric such as precision, specificity, accuracy, sensitivity and F1 score. Model 1 performed the best, but was usually only slightly ahead of Model 2. The ROC curve for Model 3 had an irregular shape implying it predicts especially poorly sometimes.

  2. Variables Align with Theoretical Effect: Model 3 has the most significant variables with 6/8. Model 1 only has 3/9 (although a few variables were close to the 0.05 threshold). Model 2 has 4/8 significant variables with one sitting at 0.056 and barely missing the cut.

  3. Statistical Performance: Model 2 had the best AIC and BIC of the models compared while Model 1 was the worst. This is extra important because Model 2 is a reduced version of Model 1. Being the better fit indicates that unless Model 2 performed notably poorly elsewhere, it should be ahead of Model 1 as a choice. Model 3 performed respectably in AIC and BIC, but was still behind Model 2.

Model 2, by means of being top 2 or the best in every comparison, emerged as the most consistently good model we had. It was tied with Model 3 as our most parsimonious model with strong predictive ability. While the significant variable metrics boost Model 3, we need to consider purpose. We can view our decision as predictive vs. exploratory analysis. Model 3 might have been chosen if we wanted to study the relationship between our factors and high crime rates. Instead, we want to predict whether or not a neighborhood will have high crime rates. This means that Model 2 would be our simplest model that allows us to most accurately predict a neighborhood’s crime rate.

For binary logistic regression, we have the following criteria for selecting a model by goodness of fit:

  • Akaike Information Criterion (AIC)
  • Bayesian Information Criterion (BIC)

For numeric criteria, if the values are similar, we will consider other factors as well to sway our decision. Similar values imply that the models may have similar goodness of fit levels. It is typical to examine these values while creating models, but we will review them here:

knitr::kable(model_comparison, digits = 2,

             caption = "Model Comparison Statistics")
Model Comparison Statistics
Model Variables AIC BIC Deviance
Model 1: Full 9 234.38 275.82 214.38
Model 2: Stepwise 8 232.61 269.91 214.61
Model 3: Interactions 8 232.72 270.01 214.72

Lower is better for AIC and BIC. Accordingly, Model 2 is the best with the lowest values in each, but not by a huge margin over Model 3.

A ROC curve, along with the AUC, can be used to test how well the models make predictions. Let’s see how each model looks.

roc_curve <- function(model, title) {
  probabilities <- predict(model, type = "response")
  roc_obj <- roc(response = crime_train$target, predictor = probabilities)

  return (plot(roc_obj, main = title,col = "blue", lwd = 2, print.auc = TRUE))
}

roc1 <- roc_curve(model1, "ROC Curve for Model 1")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc2 <- roc_curve(model2, "ROC Curve for Model 2")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc3 <- roc_curve(model3, "ROC Curve for Model 3")
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

The AUC is almost identical for all 3 models of over 0.96. They all have outstanding performance with a huge area under the curve, but Model 3 appears to have a high amount of false positives according to the depressed curve in the middle of the general curve. It’s still not a bad model, but something to consider later.

Additional metrics we can check are:

  1. accuracy
  2. classification error rate
  3. precision
  4. sensitivity
  5. specificity
  6. F1 score
conf_matrix_summary <- function(predicted, title) {
  actual <- crime_train$target
  conf_matrix_result <- confusionMatrix(predicted, actual)
  accuracy_value <- conf_matrix_result$overall['Accuracy']
  precision_value <- conf_matrix_result$byClass['Pos Pred Value']
  sensitivity_value <- sensitivity(predicted, actual)
  specificity_value <- specificity(predicted, actual)
  f1_value <- conf_matrix_result$byClass['F1']
  cat(
    "\n\n", title,
    "\nAccuracy: ", accuracy_value, 
    "\nClassification error rate: ", 1 - accuracy_value,
    "\nPrecision: ", precision_value,
    "\nSensitivity: ", sensitivity_value,
    "\nSpecificity: ", specificity_value,
    "\nF1 score: ", f1_value
  )
}

conf_matrix_summary(as.factor(pred_class1), "Model 1 Summary")
## 
## 
##  Model 1 Summary 
## Accuracy:  0.8905579 
## Classification error rate:  0.1094421 
## Precision:  0.8811475 
## Sensitivity:  0.907173 
## Specificity:  0.8733624 
## F1 score:  0.8939709
conf_matrix_summary(as.factor(pred_class2), "Model 2 Summary")
## 
## 
##  Model 2 Summary 
## Accuracy:  0.8862661 
## Classification error rate:  0.1137339 
## Precision:  0.8739837 
## Sensitivity:  0.907173 
## Specificity:  0.8646288 
## F1 score:  0.8902692
conf_matrix_summary(as.factor(pred_class3), "Model 3 Summary")
## 
## 
##  Model 3 Summary 
## Accuracy:  0.8669528 
## Classification error rate:  0.1330472 
## Precision:  0.8486056 
## Sensitivity:  0.8987342 
## Specificity:  0.8340611 
## F1 score:  0.8729508

Model 1 performed the best in almost every metric. Model 1 was fit with the most variables, so that is not a surprise. Model 3 performed the worst in every metric we evaluated so far. Model 2 has the lowest AIC and BIC and otherwise has performed similarly to Model 1. It is our worst fit while the other two models perform more evenly.

We will now use McFadden’s Pseudo-R-squared to compare the models. This uses log-likelihood and is especially useful when you have nested models which is true of Models 1 and 2.

PseudoR2(model1, which = "McFadden")
##  McFadden 
## 0.6680762
PseudoR2(model2, which = "McFadden")
##  McFadden 
## 0.6677169

Model 1 has a higher value of 0.668 than the 0.667 of Model 2 indicating that it is the better fit.

We have established that in terms of fit, the order goes Model 1 > Model 2 > Model 3. We will now determine whether or not Model 1 is truly our best model.

anova_m1_2 <- anova(model2, model1, test = "Chisq")
anova_m1_2
## Analysis of Deviance Table
## 
## Model 1: target ~ zn + chas + rm + age + dis + rad + ptratio + nox_log
## Model 2: target ~ zn + chas + rm + age + dis + rad + ptratio + nox_log + 
##     lstat_log
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       457     214.61                     
## 2       456     214.38  1  0.23207     0.63

An ANOVA test can be used to see how significant the difference between the models is when one model is a subset of the other. Our deviance value of 0.232 is fairly low and suggests that Model 1 does not gain much from having the extra variable.

Interpretation:

The different variable, lstat_log, is now the point of contention. This is the log transformed lstat variable which indicates the percentage of the population that is of lower status. Theoretically, the higher this percentage is, the more crime is expected since there is more incentive to commit crime. Model 1 treats this as a negative coefficient. This provides some reason to doubt that Model 1 is actually our best choice. The improvement in predictive ability of Model 1 (confusion matrix) is relatively small over Model 2 with goodness of fit (AIC/BIC) actually getting worse. Model 2 would be the parsimonious pick even if we ignore the odd implications of the lstat_log predictor.

#final predictions from chosen model
submission <- crime_eval_data
submission$TARGET_CRIME <- eval_pred_prob2
submission$TARGET_CLASS <- eval_pred_class2

#saving the predictions to a new file
write.csv(submission, "crime-predictions.csv", row.names = FALSE)

cat("Final predictions saved.\n")
## Final predictions saved.