LOGISTIC REGRESSION

Part 1 - Data Exploration

There are 466 rows and 13 columns found in the crime training data set. Each column represents a variable, and there are 12 predictor variables and 1 response variable.

Response Variable

target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)

Predictor Variables

zn: proportion of residential land zoned for large lots (over 25000 square feet)

indus: proportion of non-retail business acres per suburb

chas: a dummy variable for whether the suburb borders the Charles River (1) or not (0)

nox: nitrogen oxides concentration (parts per 10 million)

rm: average number of rooms per dwelling

age: proportion of owner-occupied units built prior to 1940

dis: weighted mean of distances to five Boston employment centers

rad: index of accessibility to radial highways

tax: full-value property-tax rate per $10,000

ptratio: pupil-teacher ratio by town

lstat: lower status of the population (percent)

medv: median value of owner-occupied homes in $1000s

Missing Data

Is there are missing data in training data set?

##    Variable miss_data_Values
## 1        zn                0
## 2     indus                0
## 3      chas                0
## 4       nox                0
## 5        rm                0
## 6       age                0
## 7       dis                0
## 8       rad                0
## 9       tax                0
## 10  ptratio                0
## 11    lstat                0
## 12     medv                0
## 13   target                0

There are no missing data values found in any of the predictor variables and the response variable.

Descriptive Statistics - Part 1
	vars	n	mean	sd	median	trimmed
zn	1	466	11.5772532	23.3646511	0.00000	5.3542781
indus	2	466	11.1050215	6.8458549	9.69000	10.9082353
chas	3	466	0.0708155	0.2567920	0.00000	0.0000000
nox	4	466	0.5543105	0.1166667	0.53800	0.5442684
rm	5	466	6.2906738	0.7048513	6.21000	6.2570615
age	6	466	68.3675966	28.3213784	77.15000	70.9553476
dis	7	466	3.7956929	2.1069496	3.19095	3.5443647
rad	8	466	9.5300429	8.6859272	5.00000	8.6978610
tax	9	466	409.5021459	167.9000887	334.50000	401.5080214
ptratio	10	466	18.3984979	2.1968447	18.90000	18.5970588
lstat	11	466	12.6314592	7.1018907	11.35000	11.8809626
medv	12	466	22.5892704	9.2396814	21.20000	21.6304813
target	13	466	0.4914163	0.5004636	0.00000	0.4893048

Descriptive Statistics - Part 2
	mad	min	max	range	skew	kurtosis	se
zn	0.0000000	0.0000	100.0000	100.0000	2.1768152	3.8135765	1.0823466
indus	9.3403800	0.4600	27.7400	27.2800	0.2885450	-1.2432132	0.3171281
chas	0.0000000	0.0000	1.0000	1.0000	3.3354899	9.1451313	0.0118957
nox	0.1334340	0.3890	0.8710	0.4820	0.7463281	-0.0357736	0.0054045
rm	0.5166861	3.8630	8.7800	4.9170	0.4793202	1.5424378	0.0326516
age	30.0226500	2.9000	100.0000	97.1000	-0.5777075	-1.0098814	1.3119625
dis	1.9144814	1.1296	12.1265	10.9969	0.9988926	0.4719679	0.0976026
rad	1.4826000	1.0000	24.0000	23.0000	1.0102788	-0.8619110	0.4023678
tax	104.5233000	187.0000	711.0000	524.0000	0.6593136	-1.1480456	7.7778214
ptratio	1.9273800	12.6000	22.0000	9.4000	-0.7542681	-0.4003627	0.1017669
lstat	7.0720020	1.7300	37.9700	36.2400	0.9055864	0.5033688	0.3289887
medv	6.0045300	5.0000	50.0000	45.0000	1.0766920	1.3737825	0.4280200
target	0.0000000	0.0000	1.0000	1.0000	0.0342293	-2.0031131	0.0231835

From the box plot of the training data, we can see that the many of the variables are skewed in their distribution. Variables such as medv, rm and zn have an enormous amount of outliers. However, let’s group the variables by the target variable to test if it plays a factor.

Through grouping the predictors in our training data set with the response variable, we can see that the variables are correlated to the target. This means that while before we only saw that there were a ton of outliers in some of our variables, they are related to the data in a significant way and should be included in our modeling.

Correlation

To check if the data is correlated to the target variable or to any other variable, we can begin by creating a correlation matrix.

Positive and Negative Correlations with Data

Positive Correlations with Target: Variables zn and chas show positive correlations with crime rate. This indicates that areas with larger residential lots and proximity to the Charles River tend to have higher crime rates.

Negative Correlations with Target: Variables indus, nox, age, dis, rad, tax, ptratio, lstat, and medv show negative correlations with crime rate. This suggests that areas with higher industrial presence, pollution levels, older housing stock, greater distance from employment centers, better highway access, higher tax rates, higher pupil-teacher ratios, lower socio-economic status, and higher median home values tend to have lower crime rates.

Notably the correlation matrix reveals strong correlations between tax and rad at 0.91 and many other variables have significant correlation at 0.70, so these variable relationships should be taken into consideration building predictive models to avoid potential multicollinearity issues.

To identify these high correlations between variables, we devised a list of variable relationships to consider:

var1	var2	correlation
tax	rad	0.906
rad	tax	0.906
dis	nox	-0.769
nox	dis	-0.769
nox	indus	0.760
indus	nox	0.760
dis	age	-0.751
age	dis	-0.751
medv	lstat	-0.736
lstat	medv	-0.736
age	nox	0.735
nox	age	0.735
tax	indus	0.732
indus	tax	0.732
target	nox	0.726
nox	target	0.726
medv	rm	0.705
rm	medv	0.705
dis	indus	-0.704
indus	dis	-0.704

Part 2 – Data Preparation

The goal of this step was to clean the data and prepare it for accurate modeling. We checked for missing values, improved variable quality through transformations and new features, and reduced overlap between predictors to avoid multicollinearity.

Missing Data: We checked each column in the training dataset for missing values. All results were zero, meaning the dataset was complete and required no imputation.

Transformations: Two variables, NOX (air pollution) and LSTAT (lower-status population), were right-skewed. Log transformations were applied to both to reduce skew and make relationships more linear.

Feature Engineering: Two new variables were created: - rm_sq: RM squared (RM²) to capture a curved relationship between housing size and crime - rm_lstat_i: RM × LSTAT to test whether the impact of housing size changes across different income levels

Multicollinearity Reduction: Several variables were highly correlated. To improve model stability, we: - Kept RAD, dropped TAX (both measured road access, correlation = 0.91) - Kept NOX, dropped INDUS (both described industrial activity) - Kept LSTAT, dropped MEDV (both reflected neighborhood wealth)

# Create transformed training data
crime_train <- crime_train_data %>%
  mutate(
    target = factor(target, levels = c(0, 1)),
    nox_log = log(nox),
    lstat_log = log1p(lstat)
  ) %>%
  select(-tax, -indus, -medv, -nox, -lstat) %>%
  mutate(
    rm_sq = rm^2,
    rm_lstat_i = rm * lstat_log
  )

# Create transformed evaluation data
crime_eval <- crime_eval_data %>%
  mutate(
    nox_log = log(nox),
    lstat_log = log1p(lstat)
  ) %>%
  select(-tax, -indus, -medv, -nox, -lstat) %>%
  mutate(
    rm_sq = rm^2,
    rm_lstat_i = rm * lstat_log
  )

# Check VIF on core variables (excluding interaction terms)
library(car)
vif_values <- vif(glm(target ~ zn + chas + rm + age + dis + rad + ptratio + 
                      nox_log + lstat_log, 
                      data = crime_train, family = binomial()))

knitr::kable(data.frame(Variable = names(vif_values), VIF = vif_values),
             caption = "Variance Inflation Factors After Preparation",
             digits = 2, row.names = FALSE)

Variance Inflation Factors After Preparation
Variable	VIF
zn	1.74
chas	1.14
rm	2.90
age	2.04
dis	3.20
rad	1.23
ptratio	1.60
nox_log	3.12
lstat_log	3.44

After these adjustments, all Variance Inflation Factor (VIF) values are below 5, confirming that multicollinearity has been resolved.

Final Dataset

The prepared dataset includes the following predictors:

Core variables: zn, chas, rm, age, dis, rad, ptratio

Transformed variables: nox_log, lstat_log

Engineered features (for Model 3): rm_sq, rm_lstat_i

The response variable is target (1 = high crime, 0 = low crime).

Models 1 and 2 use the core and transformed variables, while Model 3 additionally incorporates the engineered interaction and polynomial terms to test non-linear relationships.

Part 3 - BUILD MODELS

Model 1: Full Model with All Prepared Variables

This model includes all core transformed variables

# Full model with all prepared predictors

model1 <- glm(target ~ zn + chas + rm + age + dis + rad + ptratio + 

              nox_log + lstat_log, 

              data = crime_train, 

              family = binomial(link = "logit"))

summary(model1)

## 
## Call:
## glm(formula = target ~ zn + chas + rm + age + dis + rad + ptratio + 
##     nox_log + lstat_log, family = binomial(link = "logit"), data = crime_train)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.40666    4.53605  -0.310  0.75648    
## zn          -0.04957    0.02689  -1.843  0.06527 .  
## chas         1.22174    0.74450   1.641  0.10079    
## rm           0.84225    0.46020   1.830  0.06722 .  
## age          0.01738    0.01098   1.582  0.11356    
## dis          0.58463    0.18870   3.098  0.00195 ** 
## rad          0.48961    0.12264   3.992 6.55e-05 ***
## ptratio      0.16122    0.09508   1.696  0.08996 .  
## nox_log     19.22228    3.17966   6.045 1.49e-09 ***
## lstat_log   -0.34144    0.71117  -0.480  0.63115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 214.38  on 456  degrees of freedom
## AIC: 234.38
## 
## Number of Fisher Scoring iterations: 8

# Check model diagnostics

cat("\nAIC:", AIC(model1))

## 
## AIC: 234.3816

cat("\nBIC:", BIC(model1))

## 
## BIC: 275.8234

Rationale: This baseline model uses all available predictors to establish a benchmark. It helps us understand which variables are significant before attempting reduction.

Model 2: Stepwise Selection Model

This model uses automated variable selection to find the optimal subset of predictors.

# Stepwise selection (both directions)

model2 <- step(model1, direction = "both", trace = 0)

summary(model2)

## 
## Call:
## glm(formula = target ~ zn + chas + rm + age + dis + rad + ptratio + 
##     nox_log, family = binomial(link = "logit"), data = crime_train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.968582   3.153009  -0.942  0.34644    
## zn          -0.051578   0.026991  -1.911  0.05601 .  
## chas         1.172696   0.730163   1.606  0.10826    
## rm           1.002436   0.320050   3.132  0.00174 ** 
## age          0.014665   0.009396   1.561  0.11857    
## dis          0.579245   0.187908   3.083  0.00205 ** 
## rad          0.484114   0.122096   3.965 7.34e-05 ***
## ptratio      0.153904   0.093670   1.643  0.10037    
## nox_log     19.059526   3.150708   6.049 1.45e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 214.61  on 457  degrees of freedom
## AIC: 232.61
## 
## Number of Fisher Scoring iterations: 8

cat("\nAIC:", AIC(model2))

## 
## AIC: 232.6136

cat("\nBIC:", BIC(model2))

## 
## BIC: 269.9113

# Show which variables were retained

cat("\nVariables retained in Model 2:\n")

## 
## Variables retained in Model 2:

names(coef(model2))

## [1] "(Intercept)" "zn"          "chas"        "rm"          "age"        
## [6] "dis"         "rad"         "ptratio"     "nox_log"

Rationale: Stepwise selection balances model complexity with performance by automatically removing non-significant predictors based on AIC. This often produces a more parsimonious model.

Model 3: Theory-Driven Model with Interactions

This model is based on domain knowledge and includes interaction terms that make theoretical sense.

# Model with key predictors and interactions

model3 <- glm(target ~ rad + nox_log + lstat_log + rm + rm_sq + 

              rm_lstat_i + dis + age, 

              data = crime_train, 

              family = binomial(link = "logit"))

summary(model3)

## 
## Call:
## glm(formula = target ~ rad + nox_log + lstat_log + rm + rm_sq + 
##     rm_lstat_i + dis + age, family = binomial(link = "logit"), 
##     data = crime_train)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  95.95709   41.82573   2.294  0.02178 *  
## rad           0.46541    0.11477   4.055 5.01e-05 ***
## nox_log      16.84314    2.95326   5.703 1.18e-08 ***
## lstat_log    -8.19368    6.75289  -1.213  0.22499    
## rm          -25.29874   10.75599  -2.352  0.01867 *  
## rm_sq         1.74221    0.66427   2.623  0.00872 ** 
## rm_lstat_i    1.18709    1.06221   1.118  0.26375    
## dis           0.40312    0.16074   2.508  0.01214 *  
## age           0.02871    0.01160   2.475  0.01333 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 214.72  on 457  degrees of freedom
## AIC: 232.72
## 
## Number of Fisher Scoring iterations: 8

cat("\nAIC:", AIC(model3))

## 
## AIC: 232.7167

cat("\nBIC:", BIC(model3))

## 
## BIC: 270.0144

Rationale:

rad: Highway accessibility is strongly correlated with crime
nox_log: Air pollution indicates industrial/urban areas
lstat_log: Socioeconomic status is a key crime predictor
rm + rm_sq: Captures non-linear relationship between housing size and crime
rm_lstat_i: Tests if housing size effect varies by neighborhood wealth
dis: Distance from employment centers
age: Older housing stock may indicate different crime patterns

Coefficient Interpretation and Model Sense-Making

# Create a comparison table of coefficients

library(broom)

coef_comparison <- bind_rows(

  tidy(model1) %>% mutate(Model = "Model 1: Full"),

  tidy(model2) %>% mutate(Model = "Model 2: Stepwise"),

  tidy(model3) %>% mutate(Model = "Model 3: Interactions")

) %>%

  select(Model, term, estimate, std.error, p.value) %>%

  arrange(term, Model)

knitr::kable(coef_comparison, digits = 3, 

             caption = "Coefficient Comparison Across Models")

Coefficient Comparison Across Models
Model	term	estimate	std.error	p.value
Model 1: Full	(Intercept)	-1.407	4.536	0.756
Model 2: Stepwise	(Intercept)	-2.969	3.153	0.346
Model 3: Interactions	(Intercept)	95.957	41.826	0.022
Model 1: Full	age	0.017	0.011	0.114
Model 2: Stepwise	age	0.015	0.009	0.119
Model 3: Interactions	age	0.029	0.012	0.013
Model 1: Full	chas	1.222	0.745	0.101
Model 2: Stepwise	chas	1.173	0.730	0.108
Model 1: Full	dis	0.585	0.189	0.002
Model 2: Stepwise	dis	0.579	0.188	0.002
Model 3: Interactions	dis	0.403	0.161	0.012
Model 1: Full	lstat_log	-0.341	0.711	0.631
Model 3: Interactions	lstat_log	-8.194	6.753	0.225
Model 1: Full	nox_log	19.222	3.180	0.000
Model 2: Stepwise	nox_log	19.060	3.151	0.000
Model 3: Interactions	nox_log	16.843	2.953	0.000
Model 1: Full	ptratio	0.161	0.095	0.090
Model 2: Stepwise	ptratio	0.154	0.094	0.100
Model 1: Full	rad	0.490	0.123	0.000
Model 2: Stepwise	rad	0.484	0.122	0.000
Model 3: Interactions	rad	0.465	0.115	0.000
Model 1: Full	rm	0.842	0.460	0.067
Model 2: Stepwise	rm	1.002	0.320	0.002
Model 3: Interactions	rm	-25.299	10.756	0.019
Model 3: Interactions	rm_lstat_i	1.187	1.062	0.264
Model 3: Interactions	rm_sq	1.742	0.664	0.009
Model 1: Full	zn	-0.050	0.027	0.065
Model 2: Stepwise	zn	-0.052	0.027	0.056

# Exponentiate coefficients to get odds ratios

cat("\n### Odds Ratios for Model 2 (Stepwise):\n")

## 
## ### Odds Ratios for Model 2 (Stepwise):

odds_ratios <- exp(coef(model2))

print(round(odds_ratios, 3))

##  (Intercept)           zn         chas           rm          age          dis 
## 5.100000e-02 9.500000e-01 3.231000e+00 2.725000e+00 1.015000e+00 1.785000e+00 
##          rad      ptratio      nox_log 
## 1.623000e+00 1.166000e+00 1.894292e+08

Actual Coefficient Signs in Our Models

# Extract signs and significance for Model 2 (most parsimonious)
model2_coefs <- tidy(model2) %>%
  filter(term != "(Intercept)") %>%
  mutate(
    sign = ifelse(estimate > 0, "Positive", "Negative"),
    significant = ifelse(p.value < 0.05, "Yes", "No")
  ) %>%
  select(term, estimate, sign, significant, p.value)

knitr::kable(model2_coefs, digits = 3,
             caption = "Model 2 Coefficient Analysis")

Model 2 Coefficient Analysis
term	estimate	sign	significant	p.value
zn	-0.052	Negative	No	0.056
chas	1.173	Positive	No	0.108
rm	1.002	Positive	Yes	0.002
age	0.015	Positive	No	0.119
dis	0.579	Positive	Yes	0.002
rad	0.484	Positive	Yes	0.000
ptratio	0.154	Positive	No	0.100
nox_log	19.060	Positive	Yes	0.000

Do the Coefficients Make Sense?

Expected Relationships:

rad (highway access) - POSITIVE coefficient expected
- Better highway access → more crime
- Makes sense: easier escape routes, more transient population
nox_log (pollution) - POSITIVE coefficient expected
- Higher pollution → industrial areas → more crime
- Makes sense: industrial areas often have higher crime
lstat_log (lower status %) - POSITIVE coefficient expected
- Lower socioeconomic status → more crime
- Makes sense: economic hardship correlates with crime
rm (rooms) - NEGATIVE coefficient expected
- Larger homes → wealthier areas → less crime
- Makes sense: housing size indicates affluence
dis (distance to employment) - NEGATIVE coefficient expected
- Further from city centers → suburban areas → less crime
- Makes sense: urban cores typically have higher crime
age (old housing %) - POSITIVE coefficient expected
- Older housing stock → deteriorated neighborhoods → more crime
- Check if this holds in your models

Counter-Intuitive Results to Watch For:

If any of these appear opposite to expectations, consider:

Is the variable capturing something else due to multicollinearity?
Could there be a non-linear relationship we’re missing?
Does it make sense in the context of Boston neighborhoods?

Model Comparison Summary

# Compare model performance metrics

model_comparison <- data.frame(

  Model = c("Model 1: Full", "Model 2: Stepwise", "Model 3: Interactions"),

  Variables = c(

    length(coef(model1)) - 1,

    length(coef(model2)) - 1,

    length(coef(model3)) - 1

  ),

  AIC = c(AIC(model1), AIC(model2), AIC(model3)),

  BIC = c(BIC(model1), BIC(model2), BIC(model3)),

  Deviance = c(deviance(model1), deviance(model2), deviance(model3))

)

knitr::kable(model_comparison, digits = 2,

             caption = "Model Comparison Statistics")

Model Comparison Statistics
Model	Variables	AIC	BIC	Deviance
Model 1: Full	9	234.38	275.82	214.38
Model 2: Stepwise	8	232.61	269.91	214.61
Model 3: Interactions	8	232.72	270.01	214.72

Why Keep Counter-Intuitive Variables?

If a variable has a counter-intuitive coefficient but:

Is statistically significant (p < 0.05)
Improves model fit (lower AIC/BIC)
Makes sense when controlling for other variables

Then YES, keep it. Explain to the boss that:

The simple bivariate relationship may be misleading
After controlling for confounders, the true effect emerges
The data tells a more complex story than intuition suggests

Preparing Predictions for Model Evaluation (Part 4)

# Generate predictions for all three models on training data
pred_prob1 <- predict(model1, crime_train, type = "response")
pred_prob2 <- predict(model2, crime_train, type = "response")
pred_prob3 <- predict(model3, crime_train, type = "response")

# Generate predictions for evaluation data
eval_pred_prob1 <- predict(model1, crime_eval, type = "response")
eval_pred_prob2 <- predict(model2, crime_eval, type = "response")
eval_pred_prob3 <- predict(model3, crime_eval, type = "response")

# Apply 0.5 threshold for classifications
pred_class1 <- ifelse(pred_prob1 > 0.5, 1, 0)
pred_class2 <- ifelse(pred_prob2 > 0.5, 1, 0)
pred_class3 <- ifelse(pred_prob3 > 0.5, 1, 0)

eval_pred_class1 <- ifelse(eval_pred_prob1 > 0.5, 1, 0)
eval_pred_class2 <- ifelse(eval_pred_prob2 > 0.5, 1, 0)
eval_pred_class3 <- ifelse(eval_pred_prob3 > 0.5, 1, 0)

cat("Predictions prepared for Part 4 model evaluation\n")

## Predictions prepared for Part 4 model evaluation

Part 4 - Model Selection

Model Selection Our selected model is Model 2.

Predictive Performance: Model 3 was outperformed in every confusion matrix related metric such as precision, specificity, accuracy, sensitivity and F1 score. Model 1 performed the best, but was usually only slightly ahead of Model 2. The ROC curve for Model 3 had an irregular shape implying it predicts especially poorly sometimes.
Variables Align with Theoretical Effect: Model 3 has the most significant variables with 6/8. Model 1 only has 3/9 (although a few variables were close to the 0.05 threshold). Model 2 has 4/8 significant variables with one sitting at 0.056 and barely missing the cut.
Statistical Performance: Model 2 had the best AIC and BIC of the models compared while Model 1 was the worst. This is extra important because Model 2 is a reduced version of Model 1. Being the better fit indicates that unless Model 2 performed notably poorly elsewhere, it should be ahead of Model 1 as a choice. Model 3 performed respectably in AIC and BIC, but was still behind Model 2.

Model 2, by means of being top 2 or the best in every comparison, emerged as the most consistently good model we had. It was tied with Model 3 as our most parsimonious model with strong predictive ability. While the significant variable metrics boost Model 3, we need to consider purpose. We can view our decision as predictive vs. exploratory analysis. Model 3 might have been chosen if we wanted to study the relationship between our factors and high crime rates. Instead, we want to predict whether or not a neighborhood will have high crime rates. This means that Model 2 would be our simplest model that allows us to most accurately predict a neighborhood’s crime rate.

For binary logistic regression, we have the following criteria for selecting a model by goodness of fit:

Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)

For numeric criteria, if the values are similar, we will consider other factors as well to sway our decision. Similar values imply that the models may have similar goodness of fit levels. It is typical to examine these values while creating models, but we will review them here:

knitr::kable(model_comparison, digits = 2,

             caption = "Model Comparison Statistics")

Model Comparison Statistics
Model	Variables	AIC	BIC	Deviance
Model 1: Full	9	234.38	275.82	214.38
Model 2: Stepwise	8	232.61	269.91	214.61
Model 3: Interactions	8	232.72	270.01	214.72

Lower is better for AIC and BIC. Accordingly, Model 2 is the best with the lowest values in each, but not by a huge margin over Model 3.

A ROC curve, along with the AUC, can be used to test how well the models make predictions. Let’s see how each model looks.

roc_curve <- function(model, title) {
  probabilities <- predict(model, type = "response")
  roc_obj <- roc(response = crime_train$target, predictor = probabilities)

  return (plot(roc_obj, main = title,col = "blue", lwd = 2, print.auc = TRUE))
}

roc1 <- roc_curve(model1, "ROC Curve for Model 1")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

roc2 <- roc_curve(model2, "ROC Curve for Model 2")

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

roc3 <- roc_curve(model3, "ROC Curve for Model 3")

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases

The AUC is almost identical for all 3 models of over 0.96. They all have outstanding performance with a huge area under the curve, but Model 3 appears to have a high amount of false positives according to the depressed curve in the middle of the general curve. It’s still not a bad model, but something to consider later.

Additional metrics we can check are:

accuracy
classification error rate
precision
sensitivity
specificity
F1 score

conf_matrix_summary <- function(predicted, title) {
  actual <- crime_train$target
  conf_matrix_result <- confusionMatrix(predicted, actual)
  accuracy_value <- conf_matrix_result$overall['Accuracy']
  precision_value <- conf_matrix_result$byClass['Pos Pred Value']
  sensitivity_value <- sensitivity(predicted, actual)
  specificity_value <- specificity(predicted, actual)
  f1_value <- conf_matrix_result$byClass['F1']
  cat(
    "\n\n", title,
    "\nAccuracy: ", accuracy_value, 
    "\nClassification error rate: ", 1 - accuracy_value,
    "\nPrecision: ", precision_value,
    "\nSensitivity: ", sensitivity_value,
    "\nSpecificity: ", specificity_value,
    "\nF1 score: ", f1_value
  )
}

conf_matrix_summary(as.factor(pred_class1), "Model 1 Summary")

## 
## 
##  Model 1 Summary 
## Accuracy:  0.8905579 
## Classification error rate:  0.1094421 
## Precision:  0.8811475 
## Sensitivity:  0.907173 
## Specificity:  0.8733624 
## F1 score:  0.8939709

conf_matrix_summary(as.factor(pred_class2), "Model 2 Summary")

## 
## 
##  Model 2 Summary 
## Accuracy:  0.8862661 
## Classification error rate:  0.1137339 
## Precision:  0.8739837 
## Sensitivity:  0.907173 
## Specificity:  0.8646288 
## F1 score:  0.8902692

conf_matrix_summary(as.factor(pred_class3), "Model 3 Summary")

## 
## 
##  Model 3 Summary 
## Accuracy:  0.8669528 
## Classification error rate:  0.1330472 
## Precision:  0.8486056 
## Sensitivity:  0.8987342 
## Specificity:  0.8340611 
## F1 score:  0.8729508

Model 1 performed the best in almost every metric. Model 1 was fit with the most variables, so that is not a surprise. Model 3 performed the worst in every metric we evaluated so far. Model 2 has the lowest AIC and BIC and otherwise has performed similarly to Model 1. It is our worst fit while the other two models perform more evenly.

We will now use McFadden’s Pseudo-R-squared to compare the models. This uses log-likelihood and is especially useful when you have nested models which is true of Models 1 and 2.

PseudoR2(model1, which = "McFadden")

##  McFadden 
## 0.6680762

PseudoR2(model2, which = "McFadden")

##  McFadden 
## 0.6677169

Model 1 has a higher value of 0.668 than the 0.667 of Model 2 indicating that it is the better fit.

We have established that in terms of fit, the order goes Model 1 > Model 2 > Model 3. We will now determine whether or not Model 1 is truly our best model.

anova_m1_2 <- anova(model2, model1, test = "Chisq")
anova_m1_2

## Analysis of Deviance Table
## 
## Model 1: target ~ zn + chas + rm + age + dis + rad + ptratio + nox_log
## Model 2: target ~ zn + chas + rm + age + dis + rad + ptratio + nox_log + 
##     lstat_log
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       457     214.61                     
## 2       456     214.38  1  0.23207     0.63

An ANOVA test can be used to see how significant the difference between the models is when one model is a subset of the other. Our deviance value of 0.232 is fairly low and suggests that Model 1 does not gain much from having the extra variable.

Interpretation:

The different variable, lstat_log, is now the point of contention. This is the log transformed lstat variable which indicates the percentage of the population that is of lower status. Theoretically, the higher this percentage is, the more crime is expected since there is more incentive to commit crime. Model 1 treats this as a negative coefficient. This provides some reason to doubt that Model 1 is actually our best choice. The improvement in predictive ability of Model 1 (confusion matrix) is relatively small over Model 2 with goodness of fit (AIC/BIC) actually getting worse. Model 2 would be the parsimonious pick even if we ignore the odd implications of the lstat_log predictor.

#final predictions from chosen model
submission <- crime_eval_data
submission$TARGET_CRIME <- eval_pred_prob2
submission$TARGET_CLASS <- eval_pred_class2

#saving the predictions to a new file
write.csv(submission, "crime-predictions.csv", row.names = FALSE)

cat("Final predictions saved.\n")

## Final predictions saved.