DATA621-Homework 3

Author

Anthony Josue Roman, Rupendra Shrestha, Bikash Bhowmik, Haoming Chen, Jerald Melukkaran

Overview

The objective of this analysis is to create a logistic regression model that can predict the likelihood that the crime rate in a neighborhood is greater than its median. The dataset consists of several socioeconomic and environmental predictors, such as property tax rate, air pollution, housing variables, and highway access.

The training data is used for exploring the data, establishing correlations between variables, and developing multiple logistic regression models. Various specifications are considered, including full, reduced, and transformed models, in order to assess their predictive accuracy and interpretability.

A model is chosen based on its performance on classification metrics, such as accuracy, precision, sensitivity, specificity, F1 score, and AUC, as well as model fit statistics, like AIC. The selected model is implemented on the test data to estimate predicted probabilities and classifications, using a cutoff of 0.5.

In summary, this analysis involves developing a logistic regression model that can predict crime rate using socioeconomic and environmental variables, while also balancing predictive accuracy and interpretability.

1.1 Dataset Overview

The training dataset has 466 records with 13 columns; 12 are predictors and the remaining one is the response which is binary in nature, and is labeled target. The testing set on the other hand has 40 records and comprises of all predictor variables. The aim here is to generate a model using the training data such that it predicts the crime level in a neighborhood based on evaluation data.

library(tidyverse)

crime_train <- read_csv("crime-training-data_modified.csv")
crime_eval  <- read_csv("crime-evaluation-data_modified.csv")

dim(crime_train)
[1] 466  13
dim(crime_eval)
[1] 40 12

1.2 Structure of Data

The vast majority of variables in the dataset are continuous and relate to socioeconomic and environmental attributes of neighborhoods, like levels of air pollution, housing quality, proximity to highways, etc. The variable chas is dichotomous and reflects whether or not a given observation is located next to the Charles River. The response variable target is also dichotomous, so we are facing a classification task.

str(crime_train)
spc_tbl_ [466 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ zn     : num [1:466] 0 0 0 30 0 0 0 0 0 80 ...
 $ indus  : num [1:466] 19.58 19.58 18.1 4.93 2.46 ...
 $ chas   : num [1:466] 0 1 0 0 0 0 0 0 0 0 ...
 $ nox    : num [1:466] 0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
 $ rm     : num [1:466] 7.93 5.4 6.49 6.39 7.16 ...
 $ age    : num [1:466] 96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
 $ dis    : num [1:466] 2.05 1.32 1.98 7.04 2.7 ...
 $ rad    : num [1:466] 5 5 24 6 3 5 24 24 5 1 ...
 $ tax    : num [1:466] 403 403 666 300 193 384 666 666 224 315 ...
 $ ptratio: num [1:466] 14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
 $ lstat  : num [1:466] 3.7 26.82 18.85 5.19 4.82 ...
 $ medv   : num [1:466] 50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
 $ target : num [1:466] 1 1 1 0 0 0 1 1 0 0 ...
 - attr(*, "spec")=
  .. cols(
  ..   zn = col_double(),
  ..   indus = col_double(),
  ..   chas = col_double(),
  ..   nox = col_double(),
  ..   rm = col_double(),
  ..   age = col_double(),
  ..   dis = col_double(),
  ..   rad = col_double(),
  ..   tax = col_double(),
  ..   ptratio = col_double(),
  ..   lstat = col_double(),
  ..   medv = col_double(),
  ..   target = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

1.3 Missing Values

One of the key aspects of data analysis is verifying whether there are any missing values in the data as their presence can affect the performance of models. In our case, we have determined that there are no missing values in both training and testing datasets.

colSums(is.na(crime_train))
     zn   indus    chas     nox      rm     age     dis     rad     tax ptratio 
      0       0       0       0       0       0       0       0       0       0 
  lstat    medv  target 
      0       0       0 
colSums(is.na(crime_eval))
     zn   indus    chas     nox      rm     age     dis     rad     tax ptratio 
      0       0       0       0       0       0       0       0       0       0 
  lstat    medv 
      0       0 

1.4 Target Distribution

However, looking at the distribution of the target variable, we can conclude that the database is rather balanced between the two categories – high and low crime rate. It will help when building a model since the probability of the model being biased towards one category will be minimized.

table(crime_train$target)

  0   1 
237 229 
prop.table(table(crime_train$target))

        0         1 
0.5085837 0.4914163 

1.5 Correlation with Target

Analysis of the relationship between the predictor and the dependent variables is crucial in understanding the variables that are essential in predicting the level of crimes. Some of the variables that exhibit a positive relationship in relation to the level of crimes include nox, age, rad, tax, and lstat, implying that the higher the value of these variables, the more likely there will be more crimes. Some of the variables with a negative relationship include dis, zn, and medv.

cor(crime_train)[, "target"]
         zn       indus        chas         nox          rm         age 
-0.43168176  0.60485074  0.08004187  0.72610622 -0.15255334  0.63010625 
        dis         rad         tax     ptratio       lstat        medv 
-0.61867312  0.62810492  0.61111331  0.25084892  0.46912702 -0.27055071 
     target 
 1.00000000 

1.6 Correlation Matrix

Apart from identifying correlations between predictors and the response variable, a complete correlation matrix was considered to have a clear insight into the correlation between different predictors. Multicollinearity, which refers to high correlation between different predictors, may occur when such an analysis is performed.

library(corrplot)

cor_matrix <- cor(crime_train)

corrplot(
  cor_matrix,
  method = "color",
  type = "upper",
  tl.cex = 0.8,
  tl.col = "black"
)

The correlation matrix identifies a number of moderate to high levels of correlation between the predictors. Specifically, there are high levels of positive correlation between variables like rad and tax, but a high level of negative correlation between variables like dis and nox and indus.

This implies that there exists multicollinearity in the predictors. This aspect is critical since multicollinearity affects the stability and interpretation of logistic regression model coefficients. This explains the difference in coefficient signs in the models compared to their simple correlation with the outcome variable.

1.7 Visualization

The boxplot indicates that there exists some relationship between the values of the predictor nox and the target value for which our hypothesis is based. We can observe that areas falling into the category of high crime tend to have comparatively higher median values for nox.

ggplot(crime_train, aes(x = factor(target), y = nox)) +
  geom_boxplot(fill = "steelblue") +
  labs(title = "NOX vs Crime Target", x = "Target", y = "NOX")

Box plot analysis depicts that there is a marked difference in the value of nitrogen oxides (nox) for areas having low crime and high crime rate. Areas which are marked as high crime rate (target = 1) have higher median nox values with greater dispersion than that of low crime rate. The implication of the finding is that more the level of air pollution, more will be the crime rate.

1.8 Key Takeaways

In summary, the data is very clean with no missing data and a balanced target variable. There are several predictors that appear to have a significant relationship with the target, implying that they could be effective predictors when developing models. This provides justification for adopting logistic regression as a suitable technique for forecasting high crime rates.

2. Data Preparation

The dataset did not have any missing values and thus there was no need for data imputation or cleaning. In this regard, the data preparation process concentrated on making different formulations of the models that would enhance their performance and make them more interpretable.

Several formulations of the datasets have been made, which include the full dataset, an optimized dataset, and an altered version of the dataset.

colSums(is.na(crime_train))
     zn   indus    chas     nox      rm     age     dis     rad     tax ptratio 
      0       0       0       0       0       0       0       0       0       0 
  lstat    medv  target 
      0       0       0 
colSums(is.na(crime_eval))
     zn   indus    chas     nox      rm     age     dis     rad     tax ptratio 
      0       0       0       0       0       0       0       0       0       0 
  lstat    medv 
      0       0 

2.1 Missing Values

It was first checked whether any missing values existed in the training and evaluation datasets. It was found that none of them had missing values. Therefore, there was no need for applying any missing value replacement techniques like using the mean or median.

2.2 Variable Transformations

The data is clean, but some of the predictor variables show skewness and scale differences. In order to deal with this, a log transformation is applied to some of the predictor variables, namely nox, dis, tax, lstat, and medv. This approach will assist in stabilizing the variance, reducing the effects of outliers, and establishing linearity with the odds of the target variable.

A shifted log transformation (log1p) is also used for zn in order to accommodate for any zeros.

crime_train_trans <- crime_train %>%
  mutate(
    log1p_zn = log1p(zn),
    log_nox = log(nox),
    log_dis = log(dis),
    log_tax = log(tax),
    log_lstat = log(lstat),
    log_medv = log(medv)
  )

crime_eval_trans <- crime_eval %>%
  mutate(
    log1p_zn = log1p(zn),
    log_nox = log(nox),
    log_dis = log(dis),
    log_tax = log(tax),
    log_lstat = log(lstat),
    log_medv = log(medv)
  )

2.3 Feature Selection Strategy

Apart from employing all predictors in the comprehensive model, another reduced set of variables was chosen based on their correlation with the response variable and the comprehensibility of the predictor. For example, nox, rm, dis, rad, ptratio, lstat, medv, and zn were employed in the reduced model since they exhibited significant correlation with the level of crime and are critical socio-economic and environmental variables.

This technique enables one to compare a sophisticated model with a less complicated one.

2.4 Summary of Data Preparation

In general, the main goal of the data preparation phase has been to generate several variations of the original dataset in order to conduct a comparative analysis of the models. As there were no data pre-processing tasks performed, the only transformations conducted were those of feature engineering and selection.

3. Build Models

In order to solve the classification problem, three models of logistic regression will be constructed by applying various transformations and predictors. This methodology will help compare the efficiency of different models based on their complexity and predictability. These models include the full model, reduced model, and transformed model.

3.1 Model 1: Full Model

The first model uses all available predictor variables. It is used as a base model and considers all possible relations between predictor variables and target variable.

model1 <- glm(target ~ ., data = crime_train, family = binomial)

summary(model1)

Call:
glm(formula = target ~ ., family = binomial, data = crime_train)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -40.822934   6.632913  -6.155 7.53e-10 ***
zn           -0.065946   0.034656  -1.903  0.05706 .  
indus        -0.064614   0.047622  -1.357  0.17485    
chas          0.910765   0.755546   1.205  0.22803    
nox          49.122297   7.931706   6.193 5.90e-10 ***
rm           -0.587488   0.722847  -0.813  0.41637    
age           0.034189   0.013814   2.475  0.01333 *  
dis           0.738660   0.230275   3.208  0.00134 ** 
rad           0.666366   0.163152   4.084 4.42e-05 ***
tax          -0.006171   0.002955  -2.089  0.03674 *  
ptratio       0.402566   0.126627   3.179  0.00148 ** 
lstat         0.045869   0.054049   0.849  0.39608    
medv          0.180824   0.068294   2.648  0.00810 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 192.05  on 453  degrees of freedom
AIC: 218.05

Number of Fisher Scoring iterations: 9
AIC(model1)
[1] 218.0469

The coefficient values obtained by using this full model show the influence of each predictor on the odds ratio for predicting whether a neighborhood has a high crime rate. For example, some variables like nox, rad, ptratio, and lstat have positive correlations with the target variable. Some variables like dis and zn have negative correlations with the target variable.

Although some values obtained here differ from correlation values, it is because there might be interactions between predictors or multicollinearity issues. Anyway, the full model gives us a more holistic view of the data.

3.2 Model 2: Reduced Model

The second model considers only a selected few predictors, whose relationship with the target variable is known and is interpretable. This is done in order to develop a more concise model without compromising its accuracy.

model2 <- glm(
  target ~ nox + rm + dis + rad + ptratio + lstat + medv + zn,
  data = crime_train,
  family = binomial
)

summary(model2)

Call:
glm(formula = target ~ nox + rm + dis + rad + ptratio + lstat + 
    medv + zn, family = binomial, data = crime_train)

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -34.92230    5.70372  -6.123 9.20e-10 ***
nox          38.36522    5.99958   6.395 1.61e-10 ***
rm            0.23137    0.59264   0.390  0.69624    
dis           0.60820    0.19698   3.088  0.00202 ** 
rad           0.52476    0.12614   4.160 3.18e-05 ***
ptratio       0.21829    0.10599   2.060  0.03944 *  
lstat         0.08494    0.04420   1.922  0.05466 .  
medv          0.12969    0.05605   2.314  0.02068 *  
zn           -0.07742    0.03121  -2.480  0.01313 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 212.99  on 457  degrees of freedom
AIC: 230.99

Number of Fisher Scoring iterations: 9
AIC(model2)
[1] 230.9875

In contrast to the full model, the reduced model is based on only a few predictors, which have significant associations with the crime rates. Though it is easier to interpret than the full model, it may lack in terms of accuracy.

3.3 Model 3: Transformed Model

In the third model, the inclusion of logarithms of certain variables was done to compensate for any skewness in the data and any nonlinearity in the relationship among the variables.

model3 <- glm(
  target ~ log1p_zn + log_nox + log_dis + log_tax + log_lstat + log_medv + rm + ptratio + chas,
  data = crime_train_trans,
  family = binomial
)

summary(model3)

Call:
glm(formula = target ~ log1p_zn + log_nox + log_dis + log_tax + 
    log_lstat + log_medv + rm + ptratio + chas, family = binomial, 
    data = crime_train_trans)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -22.2423     6.8829  -3.232 0.001231 ** 
log1p_zn     -0.3417     0.1912  -1.787 0.073904 .  
log_nox      24.6867     3.3017   7.477 7.60e-14 ***
log_dis       3.1870     0.7548   4.222 2.42e-05 ***
log_tax       2.4000     0.6717   3.573 0.000353 ***
log_lstat     0.7902     0.5546   1.425 0.154203    
log_medv      2.7628     0.9256   2.985 0.002837 ** 
rm            0.8123     0.3833   2.119 0.034074 *  
ptratio       0.2369     0.1068   2.218 0.026569 *  
chas          1.4991     0.6510   2.303 0.021289 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 645.88  on 465  degrees of freedom
Residual deviance: 242.56  on 456  degrees of freedom
AIC: 262.56

Number of Fisher Scoring iterations: 7
AIC(model3)
[1] 262.5591

The modified model examines whether using log transformations helps improve the accuracy of the predictions. Log transformations are helpful as they limit the effect of extreme observations on the analysis while allowing for a linear association between predictor and response variables through logistic regression.

3.4 Logistic Regression Interpretation

In logistic regression, coefficients measure the amount of change in the log-odds of the response variable due to a one unit change in the explanatory variable while controlling for the other variables. The higher the coefficient, the more likely it is for high crime. Similarly, a lower coefficient reduces the probability of high crime.

As several variables have been used together in the analysis, the coefficients must be analyzed along with the other variables. This implies that there may be cases where the sign of the coefficients might contradict their correlation due to multicollinearity.

A high correlation between two variables may suggest multicollinearity, an issue that can have a significant impact on the stability of the results.

4. Select Models

For selecting the best-performing logistic regression model, different evaluation measures were considered. Accuracy, error rate, precision, sensitivity, specificity, F-measure, ROC-AUC, and confusion matrix were among those measures. Furthermore, AIC was considered to assess how well the model fits the data. This provides a complete insight into the performance of all the three candidate models.

4.1 Model Evaluation Approach

Evaluation of each model involved the application of the same on the training set, where a classification threshold of 0.5 was used. The predicted values were binarized to determine the accuracy and effectiveness of the model.

library(pROC)

get_metrics <- function(model, data, actual_col = "target") {
  probs <- predict(model, newdata = data, type = "response")
  preds <- ifelse(probs >= 0.5, 1, 0)
  actual <- data[[actual_col]]
  
  cm <- table(Predicted = preds, Actual = actual)
  
  TN <- cm["0","0"]
  FP <- cm["1","0"]
  FN <- cm["0","1"]
  TP <- cm["1","1"]
  
  accuracy <- mean(preds == actual)
  error_rate <- 1 - accuracy
  precision <- TP / (TP + FP)
  sensitivity <- TP / (TP + FN)
  specificity <- TN / (TN + FP)
  f1 <- 2 * precision * sensitivity / (precision + sensitivity)
  auc_val <- as.numeric(roc(actual, probs)$auc)
  
  data.frame(
    Accuracy = accuracy,
    Error_Rate = error_rate,
    Precision = precision,
    Sensitivity = sensitivity,
    Specificity = specificity,
    F1_Score = f1,
    AUC = auc_val,
    AIC = AIC(model)
  )
}

4.2 Model Comparison

These three models have been analyzed by applying the defined criteria for performance measurement. A summary of these findings is presented below:

metrics1 <- get_metrics(model1, crime_train)
metrics2 <- get_metrics(model2, crime_train)
metrics3 <- get_metrics(model3, crime_train_trans)

model_comparison <- rbind(
  cbind(Model = "Model 1 - Full", metrics1),
  cbind(Model = "Model 2 - Reduced", metrics2),
  cbind(Model = "Model 3 - Transformed", metrics3)
)

knitr::kable(model_comparison, digits = 4)
Model Accuracy Error_Rate Precision Sensitivity Specificity F1_Score AUC AIC
Model 1 - Full 0.9163 0.0837 0.9241 0.9039 0.9283 0.9139 0.9738 218.0469
Model 2 - Reduced 0.8927 0.1073 0.9202 0.8559 0.9283 0.8869 0.9664 230.9875
Model 3 - Transformed 0.8841 0.1159 0.8821 0.8821 0.8861 0.8821 0.9579 262.5591

4.3 Confusion Matrices

Confusion matrices provide insight into how well each model classifies observations into high-crime and low-crime categories.

pred1 <- ifelse(predict(model1, type = "response") >= 0.5, 1, 0)
pred2 <- ifelse(predict(model2, type = "response") >= 0.5, 1, 0)
pred3 <- ifelse(predict(model3, newdata = crime_train_trans, type = "response") >= 0.5, 1, 0)

table(Model1 = pred1, Actual = crime_train$target)
      Actual
Model1   0   1
     0 220  22
     1  17 207
table(Model2 = pred2, Actual = crime_train$target)
      Actual
Model2   0   1
     0 220  33
     1  17 196
table(Model3 = pred3, Actual = crime_train$target)
      Actual
Model3   0   1
     0 210  27
     1  27 202

4.4 Final Model Selection

Based on the evaluation results, Model 1 (the full model) was selected as the final model. It achieved the highest overall performance across multiple metrics, including accuracy, F1 score, and AUC, while also having the lowest AIC among the three models.

Although Model 2 offered a simpler and more interpretable structure, its predictive performance was slightly lower. Model 3, which incorporated transformed variables, did not show sufficient improvement to justify the added complexity.

Therefore, Model 1 was selected because it provides the best balance between predictive performance and model fit. While it is more complex, its superior performance makes it the most appropriate choice for predicting high crime levels.

The model comparison results show that Model 1 achieves the highest accuracy, F1 score, and AUC, along with the lowest AIC value. Model 2 performs slightly worse but offers a simpler structure, while Model 3 shows the weakest performance despite using transformed variables. These results indicate that including all predictors provides the strongest predictive performance for this dataset.

5. Predictions for Evaluation Data

The final selected model (Model 1) was then applied to the validation data set for prediction purposes. A cut-off point of 0.5 was then considered for determining whether the output from the model would be categorized as high crime rates or not.

eval_prob <- predict(model1, newdata = crime_eval, type = "response")
eval_class <- ifelse(eval_prob >= 0.5, 1, 0)
crime_eval_predictions <- crime_eval %>%
  mutate(
    probability = eval_prob,
    classification = eval_class
  )

head(crime_eval_predictions)
# A tibble: 6 × 14
     zn indus  chas   nox    rm   age   dis   rad   tax ptratio lstat  medv
  <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl> <dbl>
1     0  7.07     0 0.469  7.18  61.1  4.97     2   242    17.8  4.03  34.7
2     0  8.14     0 0.538  6.10  84.5  4.46     4   307    21   10.3   18.2
3     0  8.14     0 0.538  6.50  94.4  4.45     4   307    21   12.8   18.4
4     0  8.14     0 0.538  5.95  82    3.99     4   307    21   27.7   13.2
5     0  5.96     0 0.499  5.85  41.5  3.93     5   279    19.2  8.77  21  
6    25  5.13     0 0.453  5.74  66.2  7.23     8   284    19.7 13.2   18.7
# ℹ 2 more variables: probability <dbl>, classification <dbl>

The generated data contains the calculated prediction probabilities for each neighborhood being assigned a classification of high crime, as well as a binary assignment of classification itself. This data could be analyzed to determine neighborhoods that may need more intervention because of their propensity for crime.

write_csv(crime_eval_predictions, "crime_eval_predictions.csv")

The predicted values file consists of the predictor variables used in the evaluation data set, together with the predicted values. This file meets the assignment requirements of providing predictions through the chosen logistic regression model.

Conclusion

Logistic regression was employed to determine whether the crime rate in a particular neighborhood was above the median. From the numerous models analyzed, the selected model emerged as having the best performances on various parameters such as accuracy, F1 score, and AUC.

It has been revealed from the findings of the research that certain socio-economic and environmental factors are highly correlated with crime rates. The selected model is very effective and accurate for determining areas of high crime.