To build a binary logistic regression model to predict whether the neighborhood will be at risk for high crime levels.
The data has 466 rows, with each record representing a set of information on crime for various neighborhoods of a major city.
The response variable is a binary determinant ‘target’ value with a 1 representing an above average crime rate and a 0 representing below average crime rate. The crime target has an approximate mean of 0.5 and standard deviation of 0.5. It follows a normal, uniform distribution and because it is a binary response there are no outliers. The data does not contain any missing values (see below below).
## zn indus chas nox rm age dis rad tax
## 0 0 0 0 0 0 0 0 0
## ptratio black lstat medv target
## 0 0 0 0 0
## zn indus chas nox rm age dis rad tax ptratio
## zn 1.00 -0.54 -0.04 -0.52 0.32 -0.57 0.66 -0.32 -0.32 -0.39
## indus -0.54 1.00 0.06 0.76 -0.39 0.64 -0.70 0.60 0.73 0.39
## chas -0.04 0.06 1.00 0.10 0.09 0.08 -0.10 -0.02 -0.05 -0.13
## nox -0.52 0.76 0.10 1.00 -0.30 0.74 -0.77 0.60 0.65 0.18
## rm 0.32 -0.39 0.09 -0.30 1.00 -0.23 0.20 -0.21 -0.30 -0.36
## age -0.57 0.64 0.08 0.74 -0.23 1.00 -0.75 0.46 0.51 0.26
## dis 0.66 -0.70 -0.10 -0.77 0.20 -0.75 1.00 -0.49 -0.53 -0.23
## rad -0.32 0.60 -0.02 0.60 -0.21 0.46 -0.49 1.00 0.91 0.47
## tax -0.32 0.73 -0.05 0.65 -0.30 0.51 -0.53 0.91 1.00 0.47
## ptratio -0.39 0.39 -0.13 0.18 -0.36 0.26 -0.23 0.47 0.47 1.00
## black 0.18 -0.36 0.04 -0.38 0.13 -0.27 0.29 -0.45 -0.44 -0.18
## lstat -0.43 0.61 -0.05 0.60 -0.63 0.61 -0.51 0.50 0.56 0.38
## medv 0.38 -0.50 0.16 -0.43 0.71 -0.38 0.26 -0.40 -0.49 -0.52
## target -0.43 0.60 0.08 0.73 -0.15 0.63 -0.62 0.63 0.61 0.25
## black lstat medv target
## zn 0.18 -0.43 0.38 -0.43
## indus -0.36 0.61 -0.50 0.60
## chas 0.04 -0.05 0.16 0.08
## nox -0.38 0.60 -0.43 0.73
## rm 0.13 -0.63 0.71 -0.15
## age -0.27 0.61 -0.38 0.63
## dis 0.29 -0.51 0.26 -0.62
## rad -0.45 0.50 -0.40 0.63
## tax -0.44 0.56 -0.49 0.61
## ptratio -0.18 0.38 -0.52 0.25
## black 1.00 -0.35 0.33 -0.35
## lstat -0.35 1.00 -0.74 0.47
## medv 0.33 -0.74 1.00 -0.27
## target -0.35 0.47 -0.27 1.00
## zn indus chas nox rm age dis rad
## 2.331610 4.128586 1.091408 5.066360 2.403392 3.235041 4.256891 7.230109
## tax ptratio black lstat medv target
## 9.240312 2.033272 1.361000 3.671738 3.855976 2.648651
Using Pearson’s Correlation Coefficient, several variables were found to have a high correlation, making a regression analysis simply redundant. Columns that can be removed due to high correlations:
Columns with low correlations:
The target variable is evenly distributed between those that are above the median crime rate (1) and those that are below (0). This makes sense because the median is by definition the middle crime rate value, so we would expect to see half above and half below the median value. The access to radial highways and the proportion of industrial businesses both appear to be possibly bimodal and the proportion of large lots (zn) is strongly negatively skewed. The large variability and non-normality indicates further exploration needed and log transformation or categorical bucketing may be appropriate. The difference in proportion of blacks per town (compared to standard of 63%) is also strongly skewed to left with a maximum value of 500, a mean of 357 and a long tail.
The explanatory/predictor variables are the remaining variables that may have a positive or negative impact on the number of crimes.
The leverage analysis shows some potentially disruptive elements in the data that would warrant further analysis. The QQ plot of the data suggests a bimodal distribution. These methods could be applied as support for model selection as a feedback loop towards refining a preferred model.
Data was transformed to either normalize or create categorical variables
#Buckets as new variable
trainingdata$medvcat <- ifelse(trainingdata$medv > 30, 1, 0) #Creating a variable by
#categorizing median value(medv) into two categories: high and low.
#Buckets for strong positive skew
trainingdata$znbi <- ifelse(trainingdata$zn > 0, 1, 0)
#Buckets as new variable
trainingdata$blackbi<-ifelse(trainingdata$black> 300, 1, 0)
#Buckets as new variable
trainingdata$znadj <- ifelse(trainingdata$zn <= 0, 0, base::log(trainingdata$zn))
#Buckets as new variable
trainingdata$ptratiocat<-ifelse(trainingdata$ptratio > 18.4, 1, 0)
#Modified log transformation on negative skew
trainingdata$ptratio<-log(1+max(trainingdata$ptratio)-trainingdata$ptratio)
#Log transformation on positive skew
trainingdata$rad <- log(trainingdata$rad)
#Modified log transformation on negative skew
trainingdata$black<-log(1+max(trainingdata$black)-trainingdata$black)
#Log transformation on positive skew
trainingdata$medv <- log(trainingdata$medv)
ptratio has a skewness of 0.036048 and a kurtosis of 2.1621968, and is therefore negatively skewed. black has a skewness of 0.6064208 and a kurtosis of 2.526187, and is therefore negatively skewed. The transformations applied to the data are used to normalize the variables in terms of skewness and kurtosis: ptratio is first transformed to its squared root, then the ptratio and black values are reduced by their maximum values before a log transformation. The values for zn, rad, and med are transformed into their log values.
Given the massive skew in the column measuring the proportion of blacks per town (trainingdata$black), one suggestion is to create two buckets: (1) rows with <300 black, and (2) rows with >300 black
trainingdata2 <- trainingdata[which(trainingdata$black < 300),]
trainingdata3 <- trainingdata[which(trainingdata$black > 300),]
This does not fix the skew for the second bucket (>300 black) but it does help fix the skew for the lower end (<300 black):
trainingdata2$black <- sqrt(trainingdata2$black)
The first created model includes all variables remaining in the data set, except those that are the categorical variables derived from the data.
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.589526 2.367623 -4.0503 5.116e-05
## zn -0.059383 0.018076 -3.2852 0.0010190
## indus 0.107844 0.031608 3.4120 0.0006449
## chas 0.805699 0.663029 1.2152 0.2242981
## rm 0.549331 0.471245 1.1657 0.2437354
## rad 2.665575 0.500575 5.3250 1.009e-07
## ptratio -0.058772 0.401026 -0.1466 0.8834848
## black 0.394966 0.126445 3.1236 0.0017864
## medv 0.012415 1.101354 0.0113 0.9910061
##
## n = 350 p = 9
## Deviance = 222.30558 Null Deviance = 485.02015 (Difference = 262.71457)
Discussion of model:
Variables which significantly contribute to this model are zn, indus, rad, and black. The coefficient for zn is negative, indicating in this model that the land zoned for larger plots may lower the risk of higher crime rates. The coefficient for indus is positive indicating the higher proportion of industrial businesses increase the risk of crime. Rad, a measure of access to radial highways also has a positive impact on the risk. The variable black measures the proportion of blacks per town, has a positive coefficient as well. The others included in the model such as chas, rm, ptratio and medv, while somewhat correlated with the a target variable, are not significant to this model when the other variables are included.
The second model is uses some categorical variables created in place of numerical. These categorical variables related to zn, blackbi, medvcat, ptratioca. These were included in place of some transformed variables as perhaps easier to interpret than more complex transformations.
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.610984 2.849788 -0.9162 0.3595606
## indus 0.126117 0.031679 3.9811 6.860e-05
## chas 0.824784 0.609966 1.3522 0.1763176
## rm -0.445733 0.400772 -1.1122 0.2660585
## rad 3.125545 0.613915 5.0912 3.559e-07
## medvcat 3.006411 0.844200 3.5613 0.0003691
## znbi 9.096712 3.463819 2.6262 0.0086342
## blackbi -1.472282 1.129783 -1.3032 0.1925218
## znadj -3.505107 1.157319 -3.0286 0.0024565
## ptratiocat 0.085560 0.367759 0.2327 0.8160317
##
## n = 350 p = 10
## Deviance = 215.68241 Null Deviance = 485.02015 (Difference = 269.33774)
Discussion of model:
Variables which significantly contribute to this model are indus, rad, znbi, and medvcat. The coefficient for indus is positive indicating the higher proportion of industrial businesses increase the risk of crime. Rad, a measure of access to radial highways also has a positive impact on the risk. The variable znbi categorizes the data into those that are zoned for large lots and those that are not. A negative coefficient for this variable indicates a decrease in risk if there is any land zoned. The categorical variable medvcat is equal to 1 if the the median value is greater than 30k, the positive coefficient somewhat counterintuitively implies those above that value have increased crime risk.
The others included in the model such as chas, rm, ptratio and medv, while somewhat correlated with the a target variable, are not significant to this model when the other variables are included.
Model 3 was built using information from the first two models in order to create a model with limited number of variables that can easily be interpreted. In the first two models the amount of industry and relation to highways are both important. The presence of large lots was also important as well as the median value of the property. For easy interpretation these last two were kept as categorical variables.
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.795485 1.036697 -6.5549 5.566e-11
## indus 0.143660 0.030032 4.7835 1.723e-06
## rad 3.140398 0.568518 5.5238 3.317e-08
## znbi -1.853077 0.518199 -3.5760 0.0003489
## medvcat 1.994872 0.563258 3.5417 0.0003976
##
## n = 350 p = 5
## Deviance = 233.49703 Null Deviance = 485.02015 (Difference = 251.52312)
Discussion of model:
All variables included in this model were significant. The coefficients from this model align with the other models in the positive and negative influence on the target variable. As in the first two models, the coefficients for indus and rad are positive indicating the higher proportion of industrial businesses increase the risk of crime. The variable znbi categorizes the data into those that are zoned for large lots and those that are not. A negative coefficient for this variable indicates a decrease in risk if there is any land zoned. The categorical variable medvcat is equal to 1 if the the median value is greater than 30k, the positive coefficient somewhat counterintuitively implies those above that value have increased crime risk.
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| Accuracy | 0.8534 | 0.8448 | 0.819 |
| Precision | 0.7971 | 0.8125 | 0.7761 |
| Sensitivity | 0.9483 | 0.8966 | 0.9483 |
| Specificity | 0.7586 | 0.7931 | 0.7414 |
| F1 Score | 0.8661 | 0.8525 | 0.8536 |
| CER | 0.5 | 0.5 | 0.3333 |
| AUC | 0.9229 | 0.9249 | 0.911 |
##
## Call:
## roc.formula(formula = factor(target) ~ model1pred, data = trainingdata)
##
## Data: model1pred in 237 controls (factor(target) 0) < 229 cases (factor(target) 1).
## Area under the curve: 0.9229
##
## Call:
## roc.formula(formula = factor(target) ~ model2pred, data = trainingdata)
##
## Data: model2pred in 237 controls (factor(target) 0) < 229 cases (factor(target) 1).
## Area under the curve: 0.9249
##
## Call:
## roc.formula(formula = factor(target) ~ model3pred, data = trainingdata)
##
## Data: model3pred in 237 controls (factor(target) 0) < 229 cases (factor(target) 1).
## Area under the curve: 0.911
Using each of the statistical metrics, we find a near equivalence between the first and second models. We will proceed with the first of these models.
bestModel=logitModel
Run evaluation data through the selected model. Make predictions. Make sure we include appropriate transformations / calculations.
Evaluation data was treated to the same variable transformations as the original training data set.
Sample of model data that resulted in a probability greater than 50% is shown below with the model probabilities as targetprob and the resulting target value.
head(evaldata.a[which(evaldata.a$target == 1),]) #Overview of the data
## zn indus chas rm rad ptratio black medv znadj
## 4 0 8.14 0 5.950 1.386294 0.1823216 5.107762 2.580217 0
## 14 0 19.58 0 6.101 1.609438 2.0149030 5.060948 3.218876 0
## 15 0 19.58 0 5.880 1.609438 2.0149030 3.907412 2.949688 0
## 17 0 6.20 0 6.552 2.079442 1.5686159 2.865624 3.449988 0
## 18 0 6.20 0 8.247 2.079442 1.5686159 2.941804 3.877432 0
## 28 0 18.10 0 3.561 3.178054 0.6931472 3.765840 3.314186 0
## targetprob target
## 4 0.5721980 1
## 14 0.8893434 1
## 15 0.8181257 1
## 17 0.7864561 1
## 18 0.9063792 1
## 28 0.9863160 1
Sample of model data that resulted in a probability less than 50% is shown below with the model probabilities as targetprob and the resulting target value.
head(evaldata.a[which(evaldata.a$target == 0),]) #Overview of the data
## zn indus chas rm rad ptratio black medv znadj
## 1 0 7.07 0 7.185 0.6931472 1.4816045 1.623341 3.546740 0.000000
## 2 0 8.14 0 6.096 1.3862944 0.1823216 2.883683 2.901422 0.000000
## 3 0 8.14 0 6.495 1.3862944 0.1823216 2.298577 2.912351 0.000000
## 5 0 5.96 0 5.850 1.6094379 1.0986123 0.000000 3.044522 0.000000
## 6 25 5.13 0 5.741 2.0794415 0.9162907 1.026042 2.928524 3.218876
## 7 25 5.13 0 5.966 2.0794415 0.9162907 2.986692 2.772589 3.218876
## targetprob target
## 1 0.0805890 0
## 2 0.3767370 0
## 3 0.3739754 0
## 5 0.1869603 0
## 6 0.1920878 0
## 7 0.3680836 0
Summary of the target probablities and the resulting target value
summary(evaldata.a$targetprob) #summary of the predicted column
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000325 0.1857000 0.3821000 0.5059000 0.9818000 0.9988000
Applying our chosen model to the evaluation data we found the mean of the predicted values was 0.4 with a standard deviation of 0.4961389. This can be compared to the mean and standard deviation of the training data set used to create the model of 0.4914163 and 0.5004636. Below we show the results of our model and a test on a single row of data.
Testing the data on a single row yields a positive or negative result. A negative translates into a 1 (above average crime) or a 0 (below average crime)
predict(bestModel, evaldata.a[13,])
## 13
## -0.4039429
The above plot gives us some insight into the efficacy of our chosen model. There exists a small, but noticeable overlap across 0, our threshold between above average and below average crime rates.
The density of the two ‘target’ values also h ighlight this change - the training data values are fairly evenly split between 0 and 1, whereas the predicted values have a higher amount of ‘0’ values. This may be an artifact of the evaluation data, or may indicate the model is optimistic.