1 OBJECTIVE

To build a binary logistic regression model to predict whether the neighborhood will be at risk for high crime levels.

2 DATA EXPLORATION

2.1 Summary statistics

The data has 466 rows, with each record representing a set of information on crime for various neighborhoods of a major city.

The response variable is a binary determinant ‘target’ value with a 1 representing an above average crime rate and a 0 representing below average crime rate. The crime target has an approximate mean of 0.5 and standard deviation of 0.5. It follows a normal, uniform distribution and because it is a binary response there are no outliers. The data does not contain any missing values (see below below).

##      zn   indus    chas     nox      rm     age     dis     rad     tax 
##       0       0       0       0       0       0       0       0       0 
## ptratio   black   lstat    medv  target 
##       0       0       0       0       0

2.2 Correlations in the data

##            zn indus  chas   nox    rm   age   dis   rad   tax ptratio
## zn       1.00 -0.54 -0.04 -0.52  0.32 -0.57  0.66 -0.32 -0.32   -0.39
## indus   -0.54  1.00  0.06  0.76 -0.39  0.64 -0.70  0.60  0.73    0.39
## chas    -0.04  0.06  1.00  0.10  0.09  0.08 -0.10 -0.02 -0.05   -0.13
## nox     -0.52  0.76  0.10  1.00 -0.30  0.74 -0.77  0.60  0.65    0.18
## rm       0.32 -0.39  0.09 -0.30  1.00 -0.23  0.20 -0.21 -0.30   -0.36
## age     -0.57  0.64  0.08  0.74 -0.23  1.00 -0.75  0.46  0.51    0.26
## dis      0.66 -0.70 -0.10 -0.77  0.20 -0.75  1.00 -0.49 -0.53   -0.23
## rad     -0.32  0.60 -0.02  0.60 -0.21  0.46 -0.49  1.00  0.91    0.47
## tax     -0.32  0.73 -0.05  0.65 -0.30  0.51 -0.53  0.91  1.00    0.47
## ptratio -0.39  0.39 -0.13  0.18 -0.36  0.26 -0.23  0.47  0.47    1.00
## black    0.18 -0.36  0.04 -0.38  0.13 -0.27  0.29 -0.45 -0.44   -0.18
## lstat   -0.43  0.61 -0.05  0.60 -0.63  0.61 -0.51  0.50  0.56    0.38
## medv     0.38 -0.50  0.16 -0.43  0.71 -0.38  0.26 -0.40 -0.49   -0.52
## target  -0.43  0.60  0.08  0.73 -0.15  0.63 -0.62  0.63  0.61    0.25
##         black lstat  medv target
## zn       0.18 -0.43  0.38  -0.43
## indus   -0.36  0.61 -0.50   0.60
## chas     0.04 -0.05  0.16   0.08
## nox     -0.38  0.60 -0.43   0.73
## rm       0.13 -0.63  0.71  -0.15
## age     -0.27  0.61 -0.38   0.63
## dis      0.29 -0.51  0.26  -0.62
## rad     -0.45  0.50 -0.40   0.63
## tax     -0.44  0.56 -0.49   0.61
## ptratio -0.18  0.38 -0.52   0.25
## black    1.00 -0.35  0.33  -0.35
## lstat   -0.35  1.00 -0.74   0.47
## medv     0.33 -0.74  1.00  -0.27
## target  -0.35  0.47 -0.27   1.00

2.3 Variance Inflation Factors

##       zn    indus     chas      nox       rm      age      dis      rad 
## 2.331610 4.128586 1.091408 5.066360 2.403392 3.235041 4.256891 7.230109 
##      tax  ptratio    black    lstat     medv   target 
## 9.240312 2.033272 1.361000 3.671738 3.855976 2.648651

Using Pearson’s Correlation Coefficient, several variables were found to have a high correlation, making a regression analysis simply redundant. Columns that can be removed due to high correlations:

tax or rad (tax was selected as has a higher variance inflation factor of 9.24)
lstat (correlated with more variables than rm)
nox
age
dis

Columns with low correlations:

chas (which can be removed, as it’s a binary number with very few 1 values)
rm (except with medv and lstat, which are marked for removal below)
ptratio
black

2.4 Data Visualization

The target variable is evenly distributed between those that are above the median crime rate (1) and those that are below (0). This makes sense because the median is by definition the middle crime rate value, so we would expect to see half above and half below the median value. The access to radial highways and the proportion of industrial businesses both appear to be possibly bimodal and the proportion of large lots (zn) is strongly negatively skewed. The large variability and non-normality indicates further exploration needed and log transformation or categorical bucketing may be appropriate. The difference in proportion of blacks per town (compared to standard of 63%) is also strongly skewed to left with a maximum value of 500, a mean of 357 and a long tail.

The explanatory/predictor variables are the remaining variables that may have a positive or negative impact on the number of crimes.

3 DATA PREPARATION

3.1 Columns added

znbi
znadj
ptratiocat
blackbi
medvcat

3.2 Columns removed

tax
lstat
nox
age
dis

3.3 Unusual Observations

The leverage analysis shows some potentially disruptive elements in the data that would warrant further analysis. The QQ plot of the data suggests a bimodal distribution. These methods could be applied as support for model selection as a feedback loop towards refining a preferred model.

3.4 Diagnostic plots

zn - strong positive skew
Indus - OK
chas - OK (binomial)
rm - OK
ptratio - negative skew
black - negative skew
medv - OK / slight positive skew
rad - positive skew

3.5 Mathematical transformations

Data was transformed to either normalize or create categorical variables

#Buckets as new variable
trainingdata$medvcat <- ifelse(trainingdata$medv > 30, 1, 0) #Creating a variable by 
#categorizing median value(medv) into two categories: high and low. 

#Buckets for strong positive skew
trainingdata$znbi <- ifelse(trainingdata$zn > 0, 1, 0)

#Buckets as new variable
trainingdata$blackbi<-ifelse(trainingdata$black> 300, 1, 0)  

#Buckets as new variable
trainingdata$znadj <- ifelse(trainingdata$zn <= 0, 0, base::log(trainingdata$zn))

#Buckets as new variable
trainingdata$ptratiocat<-ifelse(trainingdata$ptratio > 18.4, 1, 0)

#Modified log transformation on negative skew
trainingdata$ptratio<-log(1+max(trainingdata$ptratio)-trainingdata$ptratio)

#Log transformation on positive skew
trainingdata$rad <- log(trainingdata$rad)


#Modified log transformation on negative skew
trainingdata$black<-log(1+max(trainingdata$black)-trainingdata$black)

#Log transformation on positive skew
trainingdata$medv <- log(trainingdata$medv)

ptratio has a skewness of 0.036048 and a kurtosis of 2.1621968, and is therefore negatively skewed. black has a skewness of 0.6064208 and a kurtosis of 2.526187, and is therefore negatively skewed. The transformations applied to the data are used to normalize the variables in terms of skewness and kurtosis: ptratio is first transformed to its squared root, then the ptratio and black values are reduced by their maximum values before a log transformation. The values for zn, rad, and med are transformed into their log values.

3.6 Create buckets

Given the massive skew in the column measuring the proportion of blacks per town (trainingdata$black), one suggestion is to create two buckets: (1) rows with <300 black, and (2) rows with >300 black

trainingdata2 <- trainingdata[which(trainingdata$black < 300),]
trainingdata3 <- trainingdata[which(trainingdata$black > 300),]

This does not fix the skew for the second bucket (>300 black) but it does help fix the skew for the lower end (<300 black):

trainingdata2$black <- sqrt(trainingdata2$black)

4 BUILD MODELS

4.1 Logit Model

The first created model includes all variables remaining in the data set, except those that are the categorical variables derived from the data.

##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -9.589526   2.367623 -4.0503 5.116e-05
## zn          -0.059383   0.018076 -3.2852 0.0010190
## indus        0.107844   0.031608  3.4120 0.0006449
## chas         0.805699   0.663029  1.2152 0.2242981
## rm           0.549331   0.471245  1.1657 0.2437354
## rad          2.665575   0.500575  5.3250 1.009e-07
## ptratio     -0.058772   0.401026 -0.1466 0.8834848
## black        0.394966   0.126445  3.1236 0.0017864
## medv         0.012415   1.101354  0.0113 0.9910061
## 
## n = 350 p = 9
## Deviance = 222.30558 Null Deviance = 485.02015 (Difference = 262.71457)

Discussion of model:

Variables which significantly contribute to this model are zn, indus, rad, and black. The coefficient for zn is negative, indicating in this model that the land zoned for larger plots may lower the risk of higher crime rates. The coefficient for indus is positive indicating the higher proportion of industrial businesses increase the risk of crime. Rad, a measure of access to radial highways also has a positive impact on the risk. The variable black measures the proportion of blacks per town, has a positive coefficient as well. The others included in the model such as chas, rm, ptratio and medv, while somewhat correlated with the a target variable, are not significant to this model when the other variables are included.

4.2 Logit Model 2

The second model is uses some categorical variables created in place of numerical. These categorical variables related to zn, blackbi, medvcat, ptratioca. These were included in place of some transformed variables as perhaps easier to interpret than more complex transformations.

##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -2.610984   2.849788 -0.9162 0.3595606
## indus        0.126117   0.031679  3.9811 6.860e-05
## chas         0.824784   0.609966  1.3522 0.1763176
## rm          -0.445733   0.400772 -1.1122 0.2660585
## rad          3.125545   0.613915  5.0912 3.559e-07
## medvcat      3.006411   0.844200  3.5613 0.0003691
## znbi         9.096712   3.463819  2.6262 0.0086342
## blackbi     -1.472282   1.129783 -1.3032 0.1925218
## znadj       -3.505107   1.157319 -3.0286 0.0024565
## ptratiocat   0.085560   0.367759  0.2327 0.8160317
## 
## n = 350 p = 10
## Deviance = 215.68241 Null Deviance = 485.02015 (Difference = 269.33774)

Discussion of model:

Variables which significantly contribute to this model are indus, rad, znbi, and medvcat. The coefficient for indus is positive indicating the higher proportion of industrial businesses increase the risk of crime. Rad, a measure of access to radial highways also has a positive impact on the risk. The variable znbi categorizes the data into those that are zoned for large lots and those that are not. A negative coefficient for this variable indicates a decrease in risk if there is any land zoned. The categorical variable medvcat is equal to 1 if the the median value is greater than 30k, the positive coefficient somewhat counterintuitively implies those above that value have increased crime risk.

The others included in the model such as chas, rm, ptratio and medv, while somewhat correlated with the a target variable, are not significant to this model when the other variables are included.

4.3 Logit Model 3

Model 3 was built using information from the first two models in order to create a model with limited number of variables that can easily be interpreted. In the first two models the amount of industry and relation to highways are both important. The presence of large lots was also important as well as the median value of the property. For easy interpretation these last two were kept as categorical variables.

##              Estimate Std. Error z value  Pr(>|z|)
## (Intercept) -6.795485   1.036697 -6.5549 5.566e-11
## indus        0.143660   0.030032  4.7835 1.723e-06
## rad          3.140398   0.568518  5.5238 3.317e-08
## znbi        -1.853077   0.518199 -3.5760 0.0003489
## medvcat      1.994872   0.563258  3.5417 0.0003976
## 
## n = 350 p = 5
## Deviance = 233.49703 Null Deviance = 485.02015 (Difference = 251.52312)

Discussion of model:

All variables included in this model were significant. The coefficients from this model align with the other models in the positive and negative influence on the target variable. As in the first two models, the coefficients for indus and rad are positive indicating the higher proportion of industrial businesses increase the risk of crime. The variable znbi categorizes the data into those that are zoned for large lots and those that are not. A negative coefficient for this variable indicates a decrease in risk if there is any land zoned. The categorical variable medvcat is equal to 1 if the the median value is greater than 30k, the positive coefficient somewhat counterintuitively implies those above that value have increased crime risk.

5 SELECT MODELS

	Model 1	Model 2	Model 3
Accuracy	0.8534	0.8448	0.819
Precision	0.7971	0.8125	0.7761
Sensitivity	0.9483	0.8966	0.9483
Specificity	0.7586	0.7931	0.7414
F1 Score	0.8661	0.8525	0.8536
CER	0.5	0.5	0.3333
AUC	0.9229	0.9249	0.911

## 
## Call:
## roc.formula(formula = factor(target) ~ model1pred, data = trainingdata)
## 
## Data: model1pred in 237 controls (factor(target) 0) < 229 cases (factor(target) 1).
## Area under the curve: 0.9229

## 
## Call:
## roc.formula(formula = factor(target) ~ model2pred, data = trainingdata)
## 
## Data: model2pred in 237 controls (factor(target) 0) < 229 cases (factor(target) 1).
## Area under the curve: 0.9249

## 
## Call:
## roc.formula(formula = factor(target) ~ model3pred, data = trainingdata)
## 
## Data: model3pred in 237 controls (factor(target) 0) < 229 cases (factor(target) 1).
## Area under the curve: 0.911

Using each of the statistical metrics, we find a near equivalence between the first and second models. We will proceed with the first of these models.

bestModel=logitModel

6 EVALUATION DATA

Run evaluation data through the selected model. Make predictions. Make sure we include appropriate transformations / calculations.

6.1 Data transformations

Evaluation data was treated to the same variable transformations as the original training data set.

6.2 Prediction

Sample of model data that resulted in a probability greater than 50% is shown below with the model probabilities as targetprob and the resulting target value.

head(evaldata.a[which(evaldata.a$target == 1),]) #Overview of the data

##    zn indus chas    rm      rad   ptratio    black     medv znadj
## 4   0  8.14    0 5.950 1.386294 0.1823216 5.107762 2.580217     0
## 14  0 19.58    0 6.101 1.609438 2.0149030 5.060948 3.218876     0
## 15  0 19.58    0 5.880 1.609438 2.0149030 3.907412 2.949688     0
## 17  0  6.20    0 6.552 2.079442 1.5686159 2.865624 3.449988     0
## 18  0  6.20    0 8.247 2.079442 1.5686159 2.941804 3.877432     0
## 28  0 18.10    0 3.561 3.178054 0.6931472 3.765840 3.314186     0
##    targetprob target
## 4   0.5721980      1
## 14  0.8893434      1
## 15  0.8181257      1
## 17  0.7864561      1
## 18  0.9063792      1
## 28  0.9863160      1

Sample of model data that resulted in a probability less than 50% is shown below with the model probabilities as targetprob and the resulting target value.

head(evaldata.a[which(evaldata.a$target == 0),]) #Overview of the data

##   zn indus chas    rm       rad   ptratio    black     medv    znadj
## 1  0  7.07    0 7.185 0.6931472 1.4816045 1.623341 3.546740 0.000000
## 2  0  8.14    0 6.096 1.3862944 0.1823216 2.883683 2.901422 0.000000
## 3  0  8.14    0 6.495 1.3862944 0.1823216 2.298577 2.912351 0.000000
## 5  0  5.96    0 5.850 1.6094379 1.0986123 0.000000 3.044522 0.000000
## 6 25  5.13    0 5.741 2.0794415 0.9162907 1.026042 2.928524 3.218876
## 7 25  5.13    0 5.966 2.0794415 0.9162907 2.986692 2.772589 3.218876
##   targetprob target
## 1  0.0805890      0
## 2  0.3767370      0
## 3  0.3739754      0
## 5  0.1869603      0
## 6  0.1920878      0
## 7  0.3680836      0

Summary of the target probablities and the resulting target value

summary(evaldata.a$targetprob) #summary of the predicted column

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000325 0.1857000 0.3821000 0.5059000 0.9818000 0.9988000

Applying our chosen model to the evaluation data we found the mean of the predicted values was 0.4 with a standard deviation of 0.4961389. This can be compared to the mean and standard deviation of the training data set used to create the model of 0.4914163 and 0.5004636. Below we show the results of our model and a test on a single row of data.

Testing the data on a single row yields a positive or negative result. A negative translates into a 1 (above average crime) or a 0 (below average crime)

predict(bestModel, evaldata.a[13,])

##         13 
## -0.4039429

The above plot gives us some insight into the efficacy of our chosen model. There exists a small, but noticeable overlap across 0, our threshold between above average and below average crime rates.

The density of the two ‘target’ values also h ighlight this change - the training data values are fairly evenly split between 0 and 1, whereas the predicted values have a higher amount of ‘0’ values. This may be an artifact of the evaluation data, or may indicate the model is optimistic.

DATA 621 - Homework 3 - Crime

Cheryl Bowersox, Christopher Martin, Robert Sellers, Edwige Talla Badjio

July 3, 2016

1 OBJECTIVE

2 DATA EXPLORATION

2.1 Summary statistics

2.2 Correlations in the data

2.3 Variance Inflation Factors

2.4 Data Visualization

3 DATA PREPARATION

3.1 Columns added

3.2 Columns removed

3.3 Unusual Observations

3.4 Diagnostic plots

3.5 Mathematical transformations

3.6 Create buckets

4 BUILD MODELS

4.1 Logit Model

4.2 Logit Model 2

4.3 Logit Model 3

5 SELECT MODELS

6 EVALUATION DATA

6.1 Data transformations

6.2 Prediction