1. Overview

In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or, variables that you derive from the variables provided). Below is a short description of the variables of interest in the data set:

zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable)
indus: proportion of non-retail business acres per suburb (predictor variable)
chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable)
nox: nitrogen oxides concentration (parts per 10 million) (predictor variable)
rm: average number of rooms per dwelling (predictor variable)
age: proportion of owner-occupied units built prior to 1940 (predictor variable)
dis: weighted mean of distances to five Boston employment centers (predictor variable)
rad: index of accessibility to radial highways (predictor variable)
tax: full-value property-tax rate per $10,000 (predictor variable)
ptratio: pupil-teacher ratio by town (predictor variable)
lstat: lower status of the population (percent) (predictor variable)
medv: median value of owner-occupied homes in $1000s (predictor variable)
target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)

2. DATA EXPLORATION

Installing Pacakges

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

## Loading required package: lattice

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

We started by loading the necessary packages for this project which allowed us to create visuals and perform various types of analyses.

Loading Datasets

Here, we can see a dataframe of the datasets prior to any manipulation. They have all of the same variables, with the exception of the evaluation dataset which does not include the target variable.

Summary Statistics

## Dimensions of the crime training dataset:

## [1] 466  13

## 
## Summary of variables in the crime training dataset:

## 'data.frame':    466 obs. of  13 variables:
##  $ zn     : num  0 0 0 30 0 0 0 0 0 80 ...
##  $ indus  : num  19.58 19.58 18.1 4.93 2.46 ...
##  $ chas   : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.605 0.871 0.74 0.428 0.488 0.52 0.693 0.693 0.515 0.392 ...
##  $ rm     : num  7.93 5.4 6.49 6.39 7.16 ...
##  $ age    : num  96.2 100 100 7.8 92.2 71.3 100 100 38.1 19.1 ...
##  $ dis    : num  2.05 1.32 1.98 7.04 2.7 ...
##  $ rad    : int  5 5 24 6 3 5 24 24 5 1 ...
##  $ tax    : int  403 403 666 300 193 384 666 666 224 315 ...
##  $ ptratio: num  14.7 14.7 20.2 16.6 17.8 20.9 20.2 20.2 20.2 16.4 ...
##  $ lstat  : num  3.7 26.82 18.85 5.19 4.82 ...
##  $ medv   : num  50 13.4 15.4 23.7 37.9 26.5 5 7 22.2 20.9 ...
##  $ target : int  1 1 1 0 0 0 1 1 0 0 ...

The crime training dataset comprises 466 observations and 13 variables, including factors like land zoning, environmental indicators, socioeconomic status, and housing features. The target variable denotes whether the crime rate is above the median, aiding in predictive modeling to identify neighborhoods at risk for higher crime levels.

##        zn             indus             chas              nox        
##  Min.   :  0.00   Min.   : 0.460   Min.   :0.00000   Min.   :0.3890  
##  1st Qu.:  0.00   1st Qu.: 5.145   1st Qu.:0.00000   1st Qu.:0.4480  
##  Median :  0.00   Median : 9.690   Median :0.00000   Median :0.5380  
##  Mean   : 11.58   Mean   :11.105   Mean   :0.07082   Mean   :0.5543  
##  3rd Qu.: 16.25   3rd Qu.:18.100   3rd Qu.:0.00000   3rd Qu.:0.6240  
##  Max.   :100.00   Max.   :27.740   Max.   :1.00000   Max.   :0.8710  
##        rm             age              dis              rad       
##  Min.   :3.863   Min.   :  2.90   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:5.887   1st Qu.: 43.88   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :6.210   Median : 77.15   Median : 3.191   Median : 5.00  
##  Mean   :6.291   Mean   : 68.37   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:6.630   3rd Qu.: 94.10   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :8.780   Max.   :100.00   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         lstat             medv      
##  Min.   :187.0   Min.   :12.6   Min.   : 1.730   Min.   : 5.00  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.: 7.043   1st Qu.:17.02  
##  Median :334.5   Median :18.9   Median :11.350   Median :21.20  
##  Mean   :409.5   Mean   :18.4   Mean   :12.631   Mean   :22.59  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:16.930   3rd Qu.:25.00  
##  Max.   :711.0   Max.   :22.0   Max.   :37.970   Max.   :50.00  
##      target      
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.4914  
##  3rd Qu.:1.0000  
##  Max.   :1.0000

This summary provides descriptive statistics for each variable in the dataset. For example, the zn variable ranges from 0 to 100 and has a mean of 11.58, indicating the proportion of residential land zoned for large lots, while the target variable is binary with a mean of 0.4914, pointing to approximately equal representation of both being above and below the median crime rate. For more information, we have to perform further analysis.

Visualizations of Predictor Variables

Histograms

By visualizing the frequencies of the predictor variables, we can see that several of the distributions are positively skewed, especially the ‘dis’ variable which represents the weighted mean of distances to five Boston employment centers, and the ‘rad’ variable which represents radial highway accessibility. We can attempt to transform them into a normal-like distribution.

Boxplot

## Warning in scale_y_continuous(trans = "log10", breaks = c(1, 10, 100, 1000), :
## log-10 transformation introduced infinite values.

## Warning: Removed 772 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

In this boxplot, we have separated the variables on whether the ‘target’ variable is a 0 or a 1. Values corresponding with neighborhoods which have a crime rate below the median, or 0, are displayed in red, whether those which have a crime rate above the median, or 1, are displayed in green. Looking at this boxplot, we can see that both are widely distributed.

Heatmap

Lastly, before manipulating any data, we created a heat map to check how correlated these variables are with one another. The ‘target’ variable has strong correlations with a few predictors. These include ‘nox’ or NO concentrations, ‘age’, ‘rad’ or accessibility to highways, and property tax rate. It is negatively correlated with ‘dis’ or distance to employment centers, and ‘zn’, or residental zoning lots.

2. DATA PREPARATION

Missing Data

Checking for Missing Data

##      zn   indus    chas     nox      rm     age     dis     rad     tax ptratio 
##       0       0       0       0       0       0       0       0       0       0 
##   lstat    medv  target 
##       0       0       0

Visualization of Missing Data

Since there is no missing data, it will not be necessary to impute or remove any data.

Outliers

In this case, as seen from the boxplot, outliers were reasonable so we decided to keep the data the same.

Rename Variables

For the purposes of transparency, we decided to rename several of the variables as this will make it easier for a viewer to interpret the data and the visuals.

Transforming Variables

Because some of the predictor variables are highly skewed, we transformed them in an attempt to create a more normal distribution. The log() function here is applied to Rad_Highway_Accessibility and dis after adding 1 to each variable. This helps as these variables contain several zero values, by avoiding any undefined result values. Overall, this reduces right skewness in the distributions of these variables, and gives us a distribution which is closer to being normal, making it more suitable for statistical analyses.

Visualizing Transformed Variables

On the lefthand side, we can see the distributions of the ‘Rad_Highway_Accessibility’ and ‘dis’ variables, before transformation. On the righthand side, we can see the same two variables after logarithmic transformation. We can see a slightly closer to normal-like distribution. We can now begin to build models for the data!

3. BUILD MODELS

Model 1 - Generalized Linear Model with All Variables

## 
## Call:
## glm(formula = Crime_Above_Median ~ ., family = binomial, data = crimetrainingdata)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -40.822934   6.632913  -6.155 7.53e-10 ***
## Residential_Zoned_Lots     -0.065946   0.034656  -1.903  0.05706 .  
## Industry_Proportion        -0.064614   0.047622  -1.357  0.17485    
## Charles_River               0.910765   0.755546   1.205  0.22803    
## NOx_Concentration          49.122297   7.931706   6.193 5.90e-10 ***
## Avg_Rooms                  -0.587488   0.722847  -0.813  0.41637    
## age                         0.034189   0.013814   2.475  0.01333 *  
## dis                         0.738660   0.230275   3.208  0.00134 ** 
## Rad_Highway_Accessibility   0.666366   0.163152   4.084 4.42e-05 ***
## Property_Tax               -0.006171   0.002955  -2.089  0.03674 *  
## Pupil_Teacher_Ratio         0.402566   0.126627   3.179  0.00148 ** 
## Percent_Lower_Status        0.045869   0.054049   0.849  0.39608    
## Median_Home_Value           0.180824   0.068294   2.648  0.00810 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 192.05  on 453  degrees of freedom
## AIC: 218.05
## 
## Number of Fisher Scoring iterations: 9

The logistic regression model estimates the relationship between predictor variables and the likelihood of a neighborhood having a crime rate above the median. We decided to first do this without maniulating any variables to take a look at the data as a whole. In this model, significant predictors are NOx concentration, radial highway accessibility, residential zoned lots, dwelling age, distance to employment centers, pupil-teacher ratio, property tax, and median home value. For instance, an increase in NOx concentration and radial highway accessibility is associated with higher odds of a neighborhood having a crime rate above the median. However, variables such as residential zoned lots, industry proportion, and average number of rooms do not show significant associations. Overall, the model suggests that environmental factors, socioeconomic indicators, and neighborhood characteristics play crucial roles in determining crime rates.

Model 2 - Generalized Linear Model with Transformed Variables

## 
## Call:
## glm(formula = Crime_Above_Median ~ ., family = binomial, data = crimetrainingdata_transformed)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -51.310805   7.756344  -6.615 3.71e-11 ***
## Residential_Zoned_Lots     -0.052757   0.031585  -1.670 0.094862 .  
## Industry_Proportion        -0.013695   0.050268  -0.272 0.785288    
## Charles_River               0.764349   0.747062   1.023 0.306242    
## NOx_Concentration          50.077439   7.919249   6.324 2.56e-10 ***
## Avg_Rooms                  -0.637495   0.750640  -0.849 0.395732    
## age                         0.038494   0.014335   2.685 0.007244 ** 
## dis                         4.252370   1.205621   3.527 0.000420 ***
## Rad_Highway_Accessibility   4.422877   0.961691   4.599 4.24e-06 ***
## Property_Tax               -0.007034   0.003246  -2.167 0.030242 *  
## Pupil_Teacher_Ratio         0.465753   0.132312   3.520 0.000431 ***
## Percent_Lower_Status        0.039874   0.054779   0.728 0.466666    
## Median_Home_Value           0.204941   0.072832   2.814 0.004895 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 187.04  on 453  degrees of freedom
## AIC: 213.04
## 
## Number of Fisher Scoring iterations: 8

The logistic regression model estimates the relationship between predictor variables and the likelihood of a neighborhood having a crime rate above the median. We decided to use the two transformed variables here to see if they would impact the regression model after becoming more normal distributions. In this model, significant predictors are NOx concentration, radial highway accessibility, residential zoned lots, dwelling age, distance to employment centers, pupil-teacher ratio, property tax, and median home value. This is similar to the previous model, but differs in the number of iterations. With less iterations, this means that the analysis was performed with less steps. The null and residual deviance are also lower in this model, indicating that this may be a slightly better fit. Overall, the model suggests that environmental factors, socioeconomic indicators, and neighborhood characteristics play crucial roles in determining crime rates.

Model 3 - Stepwise Model

## Start:  AIC=218.05
## Crime_Above_Median ~ Residential_Zoned_Lots + Industry_Proportion + 
##     Charles_River + NOx_Concentration + Avg_Rooms + age + dis + 
##     Rad_Highway_Accessibility + Property_Tax + Pupil_Teacher_Ratio + 
##     Percent_Lower_Status + Median_Home_Value
## 
##                             Df Deviance    AIC
## - Avg_Rooms                  1   192.71 216.71
## - Percent_Lower_Status       1   192.77 216.77
## - Charles_River              1   193.53 217.53
## - Industry_Proportion        1   193.99 217.99
## <none>                           192.05 218.05
## - Property_Tax               1   196.59 220.59
## - Residential_Zoned_Lots     1   196.89 220.89
## - age                        1   198.73 222.73
## - Median_Home_Value          1   199.95 223.95
## - Pupil_Teacher_Ratio        1   203.32 227.32
## - dis                        1   203.84 227.84
## - Rad_Highway_Accessibility  1   233.74 257.74
## - NOx_Concentration          1   265.05 289.05
## 
## Step:  AIC=216.71
## Crime_Above_Median ~ Residential_Zoned_Lots + Industry_Proportion + 
##     Charles_River + NOx_Concentration + age + dis + Rad_Highway_Accessibility + 
##     Property_Tax + Pupil_Teacher_Ratio + Percent_Lower_Status + 
##     Median_Home_Value
## 
##                             Df Deviance    AIC
## - Charles_River              1   194.24 216.24
## - Percent_Lower_Status       1   194.32 216.32
## - Industry_Proportion        1   194.58 216.58
## <none>                           192.71 216.71
## + Avg_Rooms                  1   192.05 218.05
## - Property_Tax               1   197.59 219.59
## - Residential_Zoned_Lots     1   198.07 220.07
## - age                        1   199.11 221.11
## - Pupil_Teacher_Ratio        1   203.53 225.53
## - dis                        1   203.85 225.85
## - Median_Home_Value          1   205.35 227.35
## - Rad_Highway_Accessibility  1   233.81 255.81
## - NOx_Concentration          1   265.14 287.14
## 
## Step:  AIC=216.24
## Crime_Above_Median ~ Residential_Zoned_Lots + Industry_Proportion + 
##     NOx_Concentration + age + dis + Rad_Highway_Accessibility + 
##     Property_Tax + Pupil_Teacher_Ratio + Percent_Lower_Status + 
##     Median_Home_Value
## 
##                             Df Deviance    AIC
## - Industry_Proportion        1   195.51 215.51
## <none>                           194.24 216.24
## - Percent_Lower_Status       1   196.33 216.33
## + Charles_River              1   192.71 216.71
## + Avg_Rooms                  1   193.53 217.53
## - Residential_Zoned_Lots     1   200.59 220.59
## - Property_Tax               1   200.75 220.75
## - age                        1   201.00 221.00
## - Pupil_Teacher_Ratio        1   203.94 223.94
## - dis                        1   204.83 224.83
## - Median_Home_Value          1   207.12 227.12
## - Rad_Highway_Accessibility  1   241.41 261.41
## - NOx_Concentration          1   265.19 285.19
## 
## Step:  AIC=215.51
## Crime_Above_Median ~ Residential_Zoned_Lots + NOx_Concentration + 
##     age + dis + Rad_Highway_Accessibility + Property_Tax + Pupil_Teacher_Ratio + 
##     Percent_Lower_Status + Median_Home_Value
## 
##                             Df Deviance    AIC
## - Percent_Lower_Status       1   197.32 215.32
## <none>                           195.51 215.51
## + Industry_Proportion        1   194.24 216.24
## + Charles_River              1   194.58 216.58
## + Avg_Rooms                  1   194.86 216.86
## - Residential_Zoned_Lots     1   202.05 220.05
## - age                        1   202.23 220.23
## - Pupil_Teacher_Ratio        1   205.01 223.01
## - dis                        1   205.96 223.96
## - Property_Tax               1   206.60 224.60
## - Median_Home_Value          1   208.13 226.13
## - Rad_Highway_Accessibility  1   249.55 267.55
## - NOx_Concentration          1   270.59 288.59
## 
## Step:  AIC=215.32
## Crime_Above_Median ~ Residential_Zoned_Lots + NOx_Concentration + 
##     age + dis + Rad_Highway_Accessibility + Property_Tax + Pupil_Teacher_Ratio + 
##     Median_Home_Value
## 
##                             Df Deviance    AIC
## <none>                           197.32 215.32
## + Percent_Lower_Status       1   195.51 215.51
## + Avg_Rooms                  1   195.75 215.75
## + Charles_River              1   195.97 215.97
## + Industry_Proportion        1   196.33 216.33
## - Residential_Zoned_Lots     1   203.45 219.45
## - Pupil_Teacher_Ratio        1   206.27 222.27
## - age                        1   207.13 223.13
## - Property_Tax               1   207.62 223.62
## - dis                        1   207.64 223.64
## - Median_Home_Value          1   208.65 224.65
## - Rad_Highway_Accessibility  1   250.98 266.98
## - NOx_Concentration          1   273.18 289.18

With the stepwise logistic regression model, at each step, the model removes one predictor variable, starting with the least significant one, and compares the resulting AIC values. The stepwise selection helps identify predictors that are most relevant for predicting the probability of crime rates above the median.The process continues until further removal of variables no longer decreases the AIC. Here, the final model includes Residential_Zoned_Lots, NOx_Concentration, age, dis, Rad_Highway_Accessibility, Property_Tax, Pupil_Teacher_Ratio, and Median_Home_Value. These variables are most associated with the crime rate, as indicated by the lowest AIC value of 215.32.

## 
## Call:
## glm(formula = Crime_Above_Median ~ Residential_Zoned_Lots + NOx_Concentration + 
##     age + dis + Rad_Highway_Accessibility + Property_Tax + Pupil_Teacher_Ratio + 
##     Median_Home_Value, family = binomial, data = crimetrainingdata)
## 
## Coefficients:
##                             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)               -37.415922   6.035013  -6.200 5.65e-10 ***
## Residential_Zoned_Lots     -0.068648   0.032019  -2.144  0.03203 *  
## NOx_Concentration          42.807768   6.678692   6.410 1.46e-10 ***
## age                         0.032950   0.010951   3.009  0.00262 ** 
## dis                         0.654896   0.214050   3.060  0.00222 ** 
## Rad_Highway_Accessibility   0.725109   0.149788   4.841 1.29e-06 ***
## Property_Tax               -0.007756   0.002653  -2.924  0.00346 ** 
## Pupil_Teacher_Ratio         0.323628   0.111390   2.905  0.00367 ** 
## Median_Home_Value           0.110472   0.035445   3.117  0.00183 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 197.32  on 457  degrees of freedom
## AIC: 215.32
## 
## Number of Fisher Scoring iterations: 9

In this model, significant predictors are NOx concentration, radial highway accessibility, residential zoned lots, dwelling age, distance to employment centers, pupil-teacher ratio, property tax, and median home value. This is once again similar to the previous models in that the significant variables are all the same. This model confirms what was shown in the other models, narrowing it down to the most significant variables.

4. SELECT MODELS

Hypotheses

Null Hypothesis: There is no association between the observed variables and the reported crime rate.

Alternative Hypothesis: There is an association between at least one of the observed variables and the reported crime rate.

Model 1 - Confusion Matrix and ROC Curve

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

With model 1, we created an ROC curve. The AUC, or the area under the curve, has a value of 0.951. Being that it is close to 1, this indicates that the logistic regression model has a high probability of correctly classifying instances, with a 95.1% chance that a randomly chosen positive instance will have a higher predicted probability of being positive. In other words, it has a high probability of distinguishing between crime rates below and above the median.

Model 2

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## NULL

With model 2, the AUC has a value of .984. Being that it is close to 1, this indicates that the logistic regression model has a high probability of correctly classifying instances, with a 98.4% chance that it is accurate. Compared to the last model, it is slightly more accurate. We can infer that this may be due to the logarithmic transformations which were performed on two of the skewed variables.

Model 3

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## NULL

With model 3, the AUC has a value of .972. Being that it is close to 1, this indicates that the logistic regression model has a high probability of correctly classifying instances, with a 97.2% chance that it is accurate. Compared to both of the other models, it is slightly better than the first model, but not better than the second model. The accuracy of all the models are still high.

Model Selection

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

##     Model       AUC Classification_Error Precision Sensitivity Specificity
## 1 Model 1 0.9509804           0.15053763 0.8627451   0.8627451   0.8333333
## 2 Model 2 0.9842593           0.09677419 0.8679245   0.9583333   0.8444444
## 3 Model 3 0.9717593           0.09677419 0.8979592   0.9166667   0.8888889

The confusion matrix is a table that represents matrices of the three different models with the counts of true negatives (0-0), false positives (0-1), false negatives (1-0), and true positives (1-1). In the first confusion matrix, there are 47 true negatives, 7 false positives, 2 false negatives, and 37 true positives, in the second matrix, there are 47 true negatives, 4 false positives, 2 false negatives, and 40 true positives, in the third confusion matrix, there are 45 true negatives, 2 false positives, 6 false negatives, and 40 true positives.

We also made a dataframe of the performance of three models. Model 2 emerges as the top performer, boasting the highest AUC value of 0.9842593, indicating the highest ability to discriminate between classes. Additionally, along with model 3, it has the lowest classification error rate of 0.09677419, suggesting minimal misclassification. Model 2 and 3 both have high precision at 0.8679245 and 0.8979592 respectfully, reflecting their accuracy in positive outcome predictions. Model 2 demonstrates the highest sensitivity at 0.9583333, indicating its effectiveness in correctly identifying true positives, however, Model 2 stands out with the highest specificity at 0.8888889, meaning that it is proficient in accurately recognizing true negatives. Overall, while all models show strong performance, Model 2 emerges as the best choice, followed by model 3.

Conclusion

In conclusion, we conducted an in-depth analysis and modeling of datasets containing information on crime rates across various neighborhoods of a major city. Three logistic regression models were built to predict whether a neighborhood would be at risk for high crime levels, utilizing different sets of predictor variables and transformations. After thorough evaluation using performance metrics such as the area under the ROC curve (AUC), classification error, precision, sensitivity, and specificity, Model 2 emerged as the top performer. This model, which incorporated logarithmic transformations of skewed variables, demonstrated the highest AUC of 0.984 and the lowest classification error rate of 0.097 among the three models. Additionally, Model 2 exhibited strong precision, sensitivity, and specificity, indicating its effectiveness in accurately predicting both positive and negative outcomes. Overall, the findings suggest that environmental factors, socioeconomic indicators, and neighborhood characteristics play significant roles in determining crime rates, and Model 2 provides the most reliable predictive performance for identifying neighborhoods at risk for higher crime levels.

Appendix

knitr::opts_chunk$set(echo=FALSE)
library(dplyr)
library(ggplot2)
library(tidyr)
library(reshape2)
library(cowplot)
library(caret)
library(pROC)
set.seed(10)  
crimeevaldata <- read.csv('https://raw.githubusercontent.com/rkasa01/DATA621_HW3/main/crime-evaluation-data_modified.csv')
head(crimeevaldata)
crimetrainingdata <-read.csv('https://raw.githubusercontent.com/rkasa01/DATA621_HW3/main/crime-training-data_modified.csv')
head(crimetrainingdata)
cat("Dimensions of the crime training dataset:\n")
dim(crimetrainingdata)

cat("\nSummary of variables in the crime training dataset:\n")
str(crimetrainingdata)
summary(crimetrainingdata)
create_histogram <- function(data, variable_name) {
  ggplot(data, aes(x = !!sym(variable_name))) +
    geom_histogram(fill = "red", color = "black", bins = 30) +
    labs(title = paste("Histogram of", variable_name), x = variable_name, y = "Frequency") +
    theme_minimal()
}
plot_histograms <- function(data) {
  variables <- names(data)[-which(names(data) == "target")]
  histograms <- lapply(variables, function(var) create_histogram(data, var))
  plot_grid(plotlist = histograms, ncol = 3)
}
plot_histograms(crimetrainingdata)
melted_data <- melt(crimetrainingdata, id.vars = "target")

ggplot(melted_data, aes(x = variable, y = value, fill = factor(target))) +
  geom_boxplot() +
  scale_y_continuous(trans = "log10", breaks = c(1, 10, 100, 1000), labels = c(1, 10, 100, 1000)) +
  labs(title = "Box Plot of Observed Variables by Neighborhood",
       x = "Variables",
       y = "Value") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))
correlation_matrix <- cor(crimetrainingdata)
correlation_df <- as.data.frame(as.table(correlation_matrix))
colnames(correlation_df) <- c("Variable_1", "Variable_2", "Correlation")

ggplot(correlation_df, aes(x = Variable_1, y = Variable_2, fill = Correlation)) +
  geom_tile() +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  labs(title = "Correlation Heatmap",
       x = "Variables",
       y = "Variables")
missing_data <- sapply(crimetrainingdata, function(x) sum(is.na(x)))
print(missing_data)
missing_percentage <- colMeans(is.na(crimetrainingdata)) * 100

ggplot(data.frame(variable = names(missing_percentage), percentage = missing_percentage), 
       aes(x = variable, y = percentage)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "black") +
  labs(title = "Percentage of Missing Data",
       x = "Variables",
       y = "Percentage")
crimetrainingdata <- crimetrainingdata %>%
  rename(Residential_Zoned_Lots = zn,
         Industry_Proportion = indus,
         Charles_River = chas,
         NOx_Concentration = nox,
         Avg_Rooms = rm,
         Rad_Highway_Accessibility = rad,
         Pupil_Teacher_Ratio = ptratio,
         Property_Tax = tax,
         Percent_Lower_Status = lstat,
         Median_Home_Value = medv,
         Crime_Above_Median = target)
head(crimetrainingdata)
crimetrainingdata_transformed <- crimetrainingdata %>%
  mutate(across(c("Rad_Highway_Accessibility", "dis"), ~log(. + 1)))
create_histogram <- function(data, variable_name) {
  ggplot(data, aes(x = !!sym(variable_name))) +
    geom_histogram(fill = "red", color = "black", bins = 30) +
    labs(title = paste("Histogram of", variable_name), x = variable_name, y = "Frequency") +
    theme_minimal()
}

hist_rad_before <- create_histogram(crimetrainingdata, "Rad_Highway_Accessibility")
hist_dis_before <- create_histogram(crimetrainingdata, "dis")

hist_rad_after <- create_histogram(crimetrainingdata_transformed, "Rad_Highway_Accessibility")
hist_dis_after <- create_histogram(crimetrainingdata_transformed, "dis")

plot_grid(hist_rad_before, hist_rad_after, hist_dis_before, hist_dis_after, ncol = 2)
original_model <- glm(Crime_Above_Median ~ ., data = crimetrainingdata, family = binomial)
summary(original_model)
model_1 <- glm(Crime_Above_Median ~ ., data = crimetrainingdata, family = binomial)
plot(model_1)
model_2 <- glm(Crime_Above_Median ~ ., data = crimetrainingdata_transformed, family = binomial)
summary(model_2)
plot(model_2)
stepwise_model <- step(model_1, direction = "both", scope = list(upper = ~ ., lower = ~ 1))
summary(stepwise_model)
plot(stepwise_model)
get_performance <- function(data_frame, model, split = 0.8) {
  n <- ncol(data_frame)
  trainIndex <- createDataPartition(data_frame[, n], p = split, list = FALSE)
  data_train <- data_frame[trainIndex, ]
  data_test <- data_frame[-trainIndex, ]
  
  x_test <- data_test[, -n]
  y_test <- data_test[, n]
  predictions <- predict(model, x_test, type = 'response')
  cm <- confusionMatrix(data = as.factor(as.numeric(predictions > 0.5)), reference = as.factor(y_test))
  roc_result <- roc(y_test, predictions)
  
  plot(roc_result, print.auc = TRUE)
  return(cm)
}

model_performance <- get_performance(crimetrainingdata, model_1)
model_performance <- get_performance(crimetrainingdata_transformed, model_2)
model_performance$confusion_matrix
model_performance <- get_performance(crimetrainingdata, stepwise_model)
model_performance$confusion_matrix
set.seed(10)  

get_performance <- function(data_frame, model, split = 0.8) {
  n <- ncol(data_frame)
  trainIndex <- createDataPartition(data_frame[, n], p = split, list = FALSE)
  data_train <- data_frame[trainIndex, ]
  data_test <- data_frame[-trainIndex, ]
  
  x_test <- data_test[, -n]
  y_test <- data_test[, n]
  predictions <- predict(model, x_test, type = 'response')
  cm <- confusionMatrix(data = as.factor(as.numeric(predictions > 0.5)), reference = as.factor(y_test))
  roc_result <- roc(y_test, predictions)
  
  classification_error <- 1 - cm$overall['Accuracy']
  precision <- cm$byClass['Pos Pred Value']
  sensitivity <- cm$byClass['Sensitivity']
  specificity <- cm$byClass['Specificity']
  
  return(list(
    confusion_matrix = cm$table,
    auc = auc(roc_result),
    classification_error = classification_error,
    precision = precision,
    sensitivity = sensitivity,
    specificity = specificity
  ))
}

performance_model_1 <- get_performance(crimetrainingdata, model_1)

performance_model_2 <- get_performance(crimetrainingdata_transformed, model_2)

performance_model_3 <- get_performance(crimetrainingdata, stepwise_model)

model_comparison <- data.frame(
  Model = c("Model 1", "Model 2", "Model 3"),
  AUC = c(performance_model_1$auc, performance_model_2$auc, performance_model_3$auc),
  Classification_Error = c(performance_model_1$classification_error, 
                            performance_model_2$classification_error, 
                            performance_model_3$classification_error),
  Precision = c(performance_model_1$precision, performance_model_2$precision, performance_model_3$precision),
  Sensitivity = c(performance_model_1$sensitivity, performance_model_2$sensitivity, performance_model_3$sensitivity),
  Specificity = c(performance_model_1$specificity, performance_model_2$specificity, performance_model_3$specificity)
)

print(model_comparison)

DATA 621: BUSINESS ANALYTICS AND DATA MINING HOMEWORK #3: LOGISTIC REGRESSION

Renida Kasa, Uliana Plotnikova, Laura Puebla

2024-03-24