Objective: to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or, variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

• zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)

• indus: proportion of non-retail business acres per suburb (predictor variable)

• chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)

• nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)

• rm: average number of rooms per dwelling (predictor variable)

• age: proportion of owner-occupied units built prior to 1940 (predictor variable)

• dis: weighted mean of distances to five Boston employment centers (predictor variable)

• rad: index of accessibility to radial highways (predictor variable)

• tax: full-value property-tax rate per $10,000 (predictor variable)

• ptratio: pupil-teacher ratio by town (predictor variable)

• lstat: lower status of the population (percent) (predictor variable)

• medv: median value of owner-occupied homes in $1000s (predictor variable)

• target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)

1. data exploration

data <- read.csv("https://raw.githubusercontent.com/Angelogallardo05/DAta-621-HW3/refs/heads/main/crime-training-data_modified.csv")

## Rows: 466
## Columns: 13
## $ zn      <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 100, 20, 0…
## $ indus   <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5.19, 3.6…
## $ chas    <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693, 0.515,…
## $ rm      <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519, 6.316,…
## $ age     <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38.1, 19.1,…
## $ dis     <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896, 1.6582…
## $ rad     <int> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5, 5, 24, …
## $ tax     <int> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330, 398, 66…
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, 16.4, 19…
## $ lstat   <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5.68, 9.25…
## $ medv    <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20.9, 24.8…
## $ target  <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…

The summary statistics reveal that nox (nitric oxides concentration) has a large range, which may indicate significant pollution levels in some areas.

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

nox is right-skewed, clustering around lower nitrogen oxide levels.dis is also right-skewed, suggesting most areas are near employment centers. rm has a bell-shaped distribution centered on average room counts, while age skews towards older homes.

numeric_vars <- names(data)[sapply(data, is.numeric)]
plots <- lapply(numeric_vars, function(var) {
  ggplot(data, aes(x = .data[[var]])) +
    geom_histogram(binwidth = diff(range(data[[var]], na.rm = TRUE)) / 30,
                   fill = "#0073C2FF", color = "black", alpha = 0.7) +
    labs(title = paste("Histogram of", var), x = var, y = "Frequency") +
    theme_minimal(base_size = 8) +
    theme(plot.title = element_text(hjust = 0.5, size = 8, face = "bold"),
          plot.margin = unit(c(1, 1, 1, 1), "lines"))
})


grid.arrange(grobs = plots, ncol = 4,
             top = textGrob("Histograms of Each Variable", gp = gpar(fontsize = 10, fontface = "bold")))

Its intereseting that higher property tax rates are associated with higher crime, however, higher crime may occur in urban areas were property are typically more valuable. There is no significant difference between areas bordering the Charles River and crime rates.

data$target <- as.factor(data$target)


long_data <- data %>%
    pivot_longer(cols = -target, names_to = "variable", values_to = "value")


p <- ggplot(long_data, aes(x = target, y = value, fill = variable)) +
    geom_boxplot() +
    facet_wrap(~ variable, scales = "free_y") +  # Create separate plots for each variable
    labs(title = "Boxplots of All Numeric Variables by Target",
         x = "Target",
         y = "Value") +
    theme_minimal()

print(p)

Property tax (tax) and accessibility to radial highways(rad) are highly correlated, while, median home values (medv) is strongly negatively correlated with the lower status of the population (lstat). Indicating that lower socioeconomic status areas tend to have lower home values, which might relate to crime rates.

cor_matrix <- cor(data[, sapply(data, is.numeric)])

corrplot(cor_matrix, 
         method = "color",  # Use color
         col = colorRampPalette(c("red", "white", "blue"))(200), 
         type = "full",    
         addCoef.col = "black", 
         tl.col = "black", 
         tl.srt = 45,    
         number.cex = 0.7 
)

2 Data Preparation

Check to see there is no missing data

print(colSums(is.na(data)))

##      zn   indus    chas     nox      rm     age     dis     rad     tax ptratio 
##       0       0       0       0       0       0       0       0       0       0 
##   lstat    medv  target 
##       0       0       0

Given that dis, nox, medc, and lstat are skewed, we will log transform to normalize it

data$log_dis <- log(data$dis)
data$log_nox <- log(data$nox)
data$log_medv <- log(data$medv)
data$log_lstat <- log(data$lstat)

remove the skewed variables to avoid multicollinearity

data_log <- data %>% 
  select(target, zn, indus, chas,rm, age, rad, tax, ptratio, 
         log_dis, log_nox, log_medv, log_lstat)

3. Build models

the first model indicates that age, rad, ptratio, log_dis, and log_nox are significant predicors of high crime risk, with positive associations with crime. Since rad and tax are highly correlated, we will remove the tax variable for our subsequent model. In addition, log_medv, zn, indus, chas, rm, and log_stat seem to be marginally significant or contribute less to the models predictive power.

model1 <- glm(target ~ ., family = binomial, data = data_log)
print(summary(model1))

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = data_log)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.873712   6.897461  -1.142 0.253646    
## zn          -0.032823   0.028391  -1.156 0.247652    
## indus       -0.028464   0.046224  -0.616 0.538045    
## chas         0.928366   0.768809   1.208 0.227225    
## rm          -0.133863   0.638001  -0.210 0.833812    
## age          0.036228   0.013541   2.675 0.007463 ** 
## rad          0.617675   0.159867   3.864 0.000112 ***
## tax         -0.004744   0.003001  -1.581 0.113937    
## ptratio      0.382430   0.127759   2.993 0.002759 ** 
## log_dis      3.167562   0.910123   3.480 0.000501 ***
## log_nox     26.342530   4.173387   6.312 2.75e-10 ***
## log_medv     3.300224   1.734249   1.903 0.057044 .  
## log_lstat   -0.007873   0.682170  -0.012 0.990792    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 195.35  on 453  degrees of freedom
## AIC: 221.35
## 
## Number of Fisher Scoring iterations: 9

The stepwise regression model started with model1 AIC of 221.35, and systematically eliminated the least significant variables based on the AIC. Through iterations, it was reduced to 215.10, by removing log_lstat, chas, and indus

model2 <- step(model1, direction = "both")

## Start:  AIC=221.35
## target ~ zn + indus + chas + rm + age + rad + tax + ptratio + 
##     log_dis + log_nox + log_medv + log_lstat
## 
##             Df Deviance    AIC
## - log_lstat  1   195.35 219.35
## - rm         1   195.39 219.39
## - indus      1   195.74 219.74
## - chas       1   196.84 220.84
## - zn         1   196.89 220.89
## <none>           195.35 221.35
## - tax        1   197.89 221.89
## - log_medv   1   199.33 223.33
## - age        1   203.07 227.07
## - ptratio    1   205.27 229.27
## - log_dis    1   208.95 232.95
## - rad        1   234.45 258.45
## - log_nox    1   267.18 291.18
## 
## Step:  AIC=219.35
## target ~ zn + indus + chas + rm + age + rad + tax + ptratio + 
##     log_dis + log_nox + log_medv
## 
##             Df Deviance    AIC
## - rm         1   195.40 217.40
## - indus      1   195.74 217.74
## - chas       1   196.86 218.86
## - zn         1   196.93 218.93
## <none>           195.35 219.35
## - tax        1   197.89 219.89
## + log_lstat  1   195.35 221.35
## - log_medv   1   199.62 221.62
## - age        1   203.92 225.92
## - ptratio    1   205.27 227.27
## - log_dis    1   209.06 231.06
## - rad        1   234.75 256.75
## - log_nox    1   267.20 289.20
## 
## Step:  AIC=217.4
## target ~ zn + indus + chas + age + rad + tax + ptratio + log_dis + 
##     log_nox + log_medv
## 
##             Df Deviance    AIC
## - indus      1   195.76 215.76
## - chas       1   196.97 216.97
## - zn         1   197.14 217.14
## <none>           195.40 217.40
## - tax        1   198.04 218.04
## + rm         1   195.35 219.35
## + log_lstat  1   195.39 219.39
## - age        1   205.49 225.49
## - log_medv   1   205.94 225.94
## - ptratio    1   206.12 226.12
## - log_dis    1   209.32 229.32
## - rad        1   235.20 255.20
## - log_nox    1   268.31 288.31
## 
## Step:  AIC=215.76
## target ~ zn + chas + age + rad + tax + ptratio + log_dis + log_nox + 
##     log_medv
## 
##             Df Deviance    AIC
## - chas       1   197.10 215.10
## - zn         1   197.56 215.56
## <none>           195.76 215.76
## + indus      1   195.40 217.40
## - tax        1   199.74 217.74
## + rm         1   195.74 217.74
## + log_lstat  1   195.76 217.76
## - age        1   205.81 223.81
## - ptratio    1   206.22 224.22
## - log_medv   1   206.42 224.42
## - log_dis    1   210.45 228.45
## - rad        1   243.06 261.06
## - log_nox    1   272.14 290.14
## 
## Step:  AIC=215.1
## target ~ zn + age + rad + tax + ptratio + log_dis + log_nox + 
##     log_medv
## 
##             Df Deviance    AIC
## <none>           197.10 215.10
## - zn         1   199.57 215.57
## + chas       1   195.76 215.76
## + indus      1   196.97 216.97
## + rm         1   197.01 217.01
## + log_lstat  1   197.06 217.06
## - tax        1   201.75 217.75
## - ptratio    1   206.57 222.57
## - log_medv   1   207.57 223.57
## - age        1   208.01 224.01
## - log_dis    1   211.25 227.25
## - rad        1   248.50 264.50
## - log_nox    1   272.28 288.28

for model3 we will use the model with the lowest AIC: target ~ zn + age + rad + tax + ptratio + log_dis + log_nox + log_medv

model3 <- glm(target ~ zn + age + rad + tax + ptratio + log_dis + log_nox + 
    log_medv,family = binomial, data = data_log)
print(summary(model3))

## 
## Call:
## glm(formula = target ~ zn + age + rad + tax + ptratio + log_dis + 
##     log_nox + log_medv, family = binomial, data = data_log)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.974269   5.053271  -1.578 0.114556    
## zn          -0.040015   0.027516  -1.454 0.145877    
## age          0.035864   0.011360   3.157 0.001594 ** 
## rad          0.671851   0.151692   4.429 9.47e-06 ***
## tax         -0.005754   0.002784  -2.067 0.038737 *  
## ptratio      0.343335   0.115350   2.976 0.002916 ** 
## log_dis      3.096195   0.867552   3.569 0.000359 ***
## log_nox     24.886374   3.826960   6.503 7.88e-11 ***
## log_medv     2.989590   0.993769   3.008 0.002627 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 197.10  on 457  degrees of freedom
## AIC: 215.1
## 
## Number of Fisher Scoring iterations: 9

4. Model selection

The confusion matrix indicates that the model3 achieved an accuracy of 92.7% on the evaluation dataset, with a 95% confidence interval of (89.95%, 94.89%). The Kappa statistic of 0.854 suggests a high level of agreement between the predicted and actual classifications, significantly better than random chance.

preds <- ifelse(predict(model3, type = "response") > 0.5, 1, 0)
print(confusionMatrix(as.factor(preds), as.factor(data$target)))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 221  18
##          1  16 211
##                                           
##                Accuracy : 0.927           
##                  95% CI : (0.8995, 0.9489)
##     No Information Rate : 0.5086          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.854           
##                                           
##  Mcnemar's Test P-Value : 0.8638          
##                                           
##             Sensitivity : 0.9325          
##             Specificity : 0.9214          
##          Pos Pred Value : 0.9247          
##          Neg Pred Value : 0.9295          
##              Prevalence : 0.5086          
##          Detection Rate : 0.4742          
##    Detection Prevalence : 0.5129          
##       Balanced Accuracy : 0.9269          
##                                           
##        'Positive' Class : 0               
##

roc_curve <- roc(data$target, predict(model3, type = "response"))

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

plot(roc_curve)

There is a 97.34% chance that the model will correctly rank a randomly chosen positive instance higher than a randomly chosen negative instance.

print(roc_curve["auc"])

## $auc
## Area under the curve: 0.9734

Now we apply the model to the evaluation data to provide probabilities

evaluation_data <- read.csv("https://raw.githubusercontent.com/Angelogallardo05/DAta-621-HW3/refs/heads/main/crime-evaluation-data_modified.csv")
evaluation_data$log_dis <- log(evaluation_data$dis)
evaluation_data$log_nox <- log(evaluation_data$nox)
evaluation_data$log_medv <- log(evaluation_data$medv)
evaluation_data$log_lstat <- log(evaluation_data$lstat)

evaluation_data <- evaluation_data %>% 
  select( zn,  age, rad, tax, ptratio, 
         log_dis, log_nox, log_medv, )

print(head(evaluation_data))

##   zn  age rad tax ptratio  log_dis    log_nox log_medv
## 1  0 61.1   2 242    17.8 1.602836 -0.7571525 3.546740
## 2  0 84.5   4 307    21.0 1.495575 -0.6198967 2.901422
## 3  0 94.4   4 307    21.0 1.493960 -0.6198967 2.912351
## 4  0 82.0   4 307    21.0 1.383791 -0.6198967 2.580217
## 5  0 41.5   5 279    19.2 1.369708 -0.6951492 3.044522
## 6 25 66.2   8 284    19.7 1.977603 -0.7918632 2.928524

Predicted probabilities with evaluation data

predicted_probabilities <- predict(model3, newdata = evaluation_data, type = "response")
predicted_probabilities <- round(predicted_probabilities, 2)
print(predicted_probabilities)

##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   16 
## 0.05 0.74 0.81 0.42 0.11 0.28 0.36 0.01 0.00 0.00 0.35 0.22 0.77 0.75 0.65 0.17 
##   17   18   19   20   21   22   23   24   25   26   27   28   29   30   31   32 
## 0.36 0.94 0.04 0.00 0.00 0.08 0.15 0.22 0.18 0.67 0.00 1.00 1.00 1.00 1.00 1.00 
##   33   34   35   36   37   38   39   40 
## 1.00 1.00 1.00 1.00 1.00 1.00 0.83 0.41

Data 621 HW3

Angel Gallardo

2024-10-30