Objective: to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or, variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:
• zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)
• indus: proportion of non-retail business acres per suburb (predictor variable)
• chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)
• nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)
• rm: average number of rooms per dwelling (predictor variable)
• age: proportion of owner-occupied units built prior to 1940 (predictor variable)
• dis: weighted mean of distances to five Boston employment centers (predictor variable)
• rad: index of accessibility to radial highways (predictor variable)
• tax: full-value property-tax rate per $10,000 (predictor variable)
• ptratio: pupil-teacher ratio by town (predictor variable)
• lstat: lower status of the population (percent) (predictor variable)
• medv: median value of owner-occupied homes in $1000s (predictor variable)
• target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)
data <- read.csv("https://raw.githubusercontent.com/Angelogallardo05/DAta-621-HW3/refs/heads/main/crime-training-data_modified.csv")
## Rows: 466
## Columns: 13
## $ zn <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 100, 20, 0…
## $ indus <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5.19, 3.6…
## $ chas <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693, 0.515,…
## $ rm <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519, 6.316,…
## $ age <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38.1, 19.1,…
## $ dis <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896, 1.6582…
## $ rad <int> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5, 5, 24, …
## $ tax <int> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330, 398, 66…
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, 16.4, 19…
## $ lstat <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5.68, 9.25…
## $ medv <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20.9, 24.8…
## $ target <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…
The summary statistics reveal that nox (nitric oxides concentration) has a large range, which may indicate significant pollution levels in some areas.
## zn indus chas nox
## Min. : 0.00 Min. : 0.460 Min. :0.00000 Min. :0.3890
## 1st Qu.: 0.00 1st Qu.: 5.145 1st Qu.:0.00000 1st Qu.:0.4480
## Median : 0.00 Median : 9.690 Median :0.00000 Median :0.5380
## Mean : 11.58 Mean :11.105 Mean :0.07082 Mean :0.5543
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.00000 3rd Qu.:0.6240
## Max. :100.00 Max. :27.740 Max. :1.00000 Max. :0.8710
## rm age dis rad
## Min. :3.863 Min. : 2.90 Min. : 1.130 Min. : 1.00
## 1st Qu.:5.887 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00
## Median :6.210 Median : 77.15 Median : 3.191 Median : 5.00
## Mean :6.291 Mean : 68.37 Mean : 3.796 Mean : 9.53
## 3rd Qu.:6.630 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :8.780 Max. :100.00 Max. :12.127 Max. :24.00
## tax ptratio lstat medv
## Min. :187.0 Min. :12.6 Min. : 1.730 Min. : 5.00
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02
## Median :334.5 Median :18.9 Median :11.350 Median :21.20
## Mean :409.5 Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :711.0 Max. :22.0 Max. :37.970 Max. :50.00
## target
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.4914
## 3rd Qu.:1.0000
## Max. :1.0000
nox is right-skewed, clustering around lower nitrogen oxide levels.dis is also right-skewed, suggesting most areas are near employment centers. rm has a bell-shaped distribution centered on average room counts, while age skews towards older homes.
numeric_vars <- names(data)[sapply(data, is.numeric)]
plots <- lapply(numeric_vars, function(var) {
ggplot(data, aes(x = .data[[var]])) +
geom_histogram(binwidth = diff(range(data[[var]], na.rm = TRUE)) / 30,
fill = "#0073C2FF", color = "black", alpha = 0.7) +
labs(title = paste("Histogram of", var), x = var, y = "Frequency") +
theme_minimal(base_size = 8) +
theme(plot.title = element_text(hjust = 0.5, size = 8, face = "bold"),
plot.margin = unit(c(1, 1, 1, 1), "lines"))
})
grid.arrange(grobs = plots, ncol = 4,
top = textGrob("Histograms of Each Variable", gp = gpar(fontsize = 10, fontface = "bold")))
Its intereseting that higher property tax rates are associated with higher crime, however, higher crime may occur in urban areas were property are typically more valuable. There is no significant difference between areas bordering the Charles River and crime rates.
data$target <- as.factor(data$target)
long_data <- data %>%
pivot_longer(cols = -target, names_to = "variable", values_to = "value")
p <- ggplot(long_data, aes(x = target, y = value, fill = variable)) +
geom_boxplot() +
facet_wrap(~ variable, scales = "free_y") + # Create separate plots for each variable
labs(title = "Boxplots of All Numeric Variables by Target",
x = "Target",
y = "Value") +
theme_minimal()
print(p)
Property tax (tax) and accessibility to radial highways(rad) are highly correlated, while, median home values (medv) is strongly negatively correlated with the lower status of the population (lstat). Indicating that lower socioeconomic status areas tend to have lower home values, which might relate to crime rates.
cor_matrix <- cor(data[, sapply(data, is.numeric)])
corrplot(cor_matrix,
method = "color", # Use color
col = colorRampPalette(c("red", "white", "blue"))(200),
type = "full",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
number.cex = 0.7
)
Check to see there is no missing data
print(colSums(is.na(data)))
## zn indus chas nox rm age dis rad tax ptratio
## 0 0 0 0 0 0 0 0 0 0
## lstat medv target
## 0 0 0
Given that dis, nox, medc, and lstat are skewed, we will log transform to normalize it
data$log_dis <- log(data$dis)
data$log_nox <- log(data$nox)
data$log_medv <- log(data$medv)
data$log_lstat <- log(data$lstat)
remove the skewed variables to avoid multicollinearity
data_log <- data %>%
select(target, zn, indus, chas,rm, age, rad, tax, ptratio,
log_dis, log_nox, log_medv, log_lstat)
the first model indicates that age, rad, ptratio, log_dis, and log_nox are significant predicors of high crime risk, with positive associations with crime. Since rad and tax are highly correlated, we will remove the tax variable for our subsequent model. In addition, log_medv, zn, indus, chas, rm, and log_stat seem to be marginally significant or contribute less to the models predictive power.
model1 <- glm(target ~ ., family = binomial, data = data_log)
print(summary(model1))
##
## Call:
## glm(formula = target ~ ., family = binomial, data = data_log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.873712 6.897461 -1.142 0.253646
## zn -0.032823 0.028391 -1.156 0.247652
## indus -0.028464 0.046224 -0.616 0.538045
## chas 0.928366 0.768809 1.208 0.227225
## rm -0.133863 0.638001 -0.210 0.833812
## age 0.036228 0.013541 2.675 0.007463 **
## rad 0.617675 0.159867 3.864 0.000112 ***
## tax -0.004744 0.003001 -1.581 0.113937
## ptratio 0.382430 0.127759 2.993 0.002759 **
## log_dis 3.167562 0.910123 3.480 0.000501 ***
## log_nox 26.342530 4.173387 6.312 2.75e-10 ***
## log_medv 3.300224 1.734249 1.903 0.057044 .
## log_lstat -0.007873 0.682170 -0.012 0.990792
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 195.35 on 453 degrees of freedom
## AIC: 221.35
##
## Number of Fisher Scoring iterations: 9
The stepwise regression model started with model1 AIC of 221.35, and systematically eliminated the least significant variables based on the AIC. Through iterations, it was reduced to 215.10, by removing log_lstat, chas, and indus
model2 <- step(model1, direction = "both")
## Start: AIC=221.35
## target ~ zn + indus + chas + rm + age + rad + tax + ptratio +
## log_dis + log_nox + log_medv + log_lstat
##
## Df Deviance AIC
## - log_lstat 1 195.35 219.35
## - rm 1 195.39 219.39
## - indus 1 195.74 219.74
## - chas 1 196.84 220.84
## - zn 1 196.89 220.89
## <none> 195.35 221.35
## - tax 1 197.89 221.89
## - log_medv 1 199.33 223.33
## - age 1 203.07 227.07
## - ptratio 1 205.27 229.27
## - log_dis 1 208.95 232.95
## - rad 1 234.45 258.45
## - log_nox 1 267.18 291.18
##
## Step: AIC=219.35
## target ~ zn + indus + chas + rm + age + rad + tax + ptratio +
## log_dis + log_nox + log_medv
##
## Df Deviance AIC
## - rm 1 195.40 217.40
## - indus 1 195.74 217.74
## - chas 1 196.86 218.86
## - zn 1 196.93 218.93
## <none> 195.35 219.35
## - tax 1 197.89 219.89
## + log_lstat 1 195.35 221.35
## - log_medv 1 199.62 221.62
## - age 1 203.92 225.92
## - ptratio 1 205.27 227.27
## - log_dis 1 209.06 231.06
## - rad 1 234.75 256.75
## - log_nox 1 267.20 289.20
##
## Step: AIC=217.4
## target ~ zn + indus + chas + age + rad + tax + ptratio + log_dis +
## log_nox + log_medv
##
## Df Deviance AIC
## - indus 1 195.76 215.76
## - chas 1 196.97 216.97
## - zn 1 197.14 217.14
## <none> 195.40 217.40
## - tax 1 198.04 218.04
## + rm 1 195.35 219.35
## + log_lstat 1 195.39 219.39
## - age 1 205.49 225.49
## - log_medv 1 205.94 225.94
## - ptratio 1 206.12 226.12
## - log_dis 1 209.32 229.32
## - rad 1 235.20 255.20
## - log_nox 1 268.31 288.31
##
## Step: AIC=215.76
## target ~ zn + chas + age + rad + tax + ptratio + log_dis + log_nox +
## log_medv
##
## Df Deviance AIC
## - chas 1 197.10 215.10
## - zn 1 197.56 215.56
## <none> 195.76 215.76
## + indus 1 195.40 217.40
## - tax 1 199.74 217.74
## + rm 1 195.74 217.74
## + log_lstat 1 195.76 217.76
## - age 1 205.81 223.81
## - ptratio 1 206.22 224.22
## - log_medv 1 206.42 224.42
## - log_dis 1 210.45 228.45
## - rad 1 243.06 261.06
## - log_nox 1 272.14 290.14
##
## Step: AIC=215.1
## target ~ zn + age + rad + tax + ptratio + log_dis + log_nox +
## log_medv
##
## Df Deviance AIC
## <none> 197.10 215.10
## - zn 1 199.57 215.57
## + chas 1 195.76 215.76
## + indus 1 196.97 216.97
## + rm 1 197.01 217.01
## + log_lstat 1 197.06 217.06
## - tax 1 201.75 217.75
## - ptratio 1 206.57 222.57
## - log_medv 1 207.57 223.57
## - age 1 208.01 224.01
## - log_dis 1 211.25 227.25
## - rad 1 248.50 264.50
## - log_nox 1 272.28 288.28
for model3 we will use the model with the lowest AIC: target ~ zn + age + rad + tax + ptratio + log_dis + log_nox + log_medv
model3 <- glm(target ~ zn + age + rad + tax + ptratio + log_dis + log_nox +
log_medv,family = binomial, data = data_log)
print(summary(model3))
##
## Call:
## glm(formula = target ~ zn + age + rad + tax + ptratio + log_dis +
## log_nox + log_medv, family = binomial, data = data_log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.974269 5.053271 -1.578 0.114556
## zn -0.040015 0.027516 -1.454 0.145877
## age 0.035864 0.011360 3.157 0.001594 **
## rad 0.671851 0.151692 4.429 9.47e-06 ***
## tax -0.005754 0.002784 -2.067 0.038737 *
## ptratio 0.343335 0.115350 2.976 0.002916 **
## log_dis 3.096195 0.867552 3.569 0.000359 ***
## log_nox 24.886374 3.826960 6.503 7.88e-11 ***
## log_medv 2.989590 0.993769 3.008 0.002627 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.10 on 457 degrees of freedom
## AIC: 215.1
##
## Number of Fisher Scoring iterations: 9
The confusion matrix indicates that the model3 achieved an accuracy of 92.7% on the evaluation dataset, with a 95% confidence interval of (89.95%, 94.89%). The Kappa statistic of 0.854 suggests a high level of agreement between the predicted and actual classifications, significantly better than random chance.
preds <- ifelse(predict(model3, type = "response") > 0.5, 1, 0)
print(confusionMatrix(as.factor(preds), as.factor(data$target)))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 221 18
## 1 16 211
##
## Accuracy : 0.927
## 95% CI : (0.8995, 0.9489)
## No Information Rate : 0.5086
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.854
##
## Mcnemar's Test P-Value : 0.8638
##
## Sensitivity : 0.9325
## Specificity : 0.9214
## Pos Pred Value : 0.9247
## Neg Pred Value : 0.9295
## Prevalence : 0.5086
## Detection Rate : 0.4742
## Detection Prevalence : 0.5129
## Balanced Accuracy : 0.9269
##
## 'Positive' Class : 0
##
roc_curve <- roc(data$target, predict(model3, type = "response"))
## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
plot(roc_curve)
There is a 97.34% chance that the model will correctly rank a randomly chosen positive instance higher than a randomly chosen negative instance.
print(roc_curve["auc"])
## $auc
## Area under the curve: 0.9734
evaluation_data <- read.csv("https://raw.githubusercontent.com/Angelogallardo05/DAta-621-HW3/refs/heads/main/crime-evaluation-data_modified.csv")
evaluation_data$log_dis <- log(evaluation_data$dis)
evaluation_data$log_nox <- log(evaluation_data$nox)
evaluation_data$log_medv <- log(evaluation_data$medv)
evaluation_data$log_lstat <- log(evaluation_data$lstat)
evaluation_data <- evaluation_data %>%
select( zn, age, rad, tax, ptratio,
log_dis, log_nox, log_medv, )
print(head(evaluation_data))
## zn age rad tax ptratio log_dis log_nox log_medv
## 1 0 61.1 2 242 17.8 1.602836 -0.7571525 3.546740
## 2 0 84.5 4 307 21.0 1.495575 -0.6198967 2.901422
## 3 0 94.4 4 307 21.0 1.493960 -0.6198967 2.912351
## 4 0 82.0 4 307 21.0 1.383791 -0.6198967 2.580217
## 5 0 41.5 5 279 19.2 1.369708 -0.6951492 3.044522
## 6 25 66.2 8 284 19.7 1.977603 -0.7918632 2.928524
predicted_probabilities <- predict(model3, newdata = evaluation_data, type = "response")
predicted_probabilities <- round(predicted_probabilities, 2)
print(predicted_probabilities)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
## 0.05 0.74 0.81 0.42 0.11 0.28 0.36 0.01 0.00 0.00 0.35 0.22 0.77 0.75 0.65 0.17
## 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
## 0.36 0.94 0.04 0.00 0.00 0.08 0.15 0.22 0.18 0.67 0.00 1.00 1.00 1.00 1.00 1.00
## 33 34 35 36 37 38 39 40
## 1.00 1.00 1.00 1.00 1.00 1.00 0.83 0.41