Introduction

We will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city.

Each record has 12 predictor variables and a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).

The model will be a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels.

Assumptions for Logistic Models

Binariness: The target variable is either 0 or 1.
Test is done in Data Exploration
Independence: Each observation is independent.
Since each row is for a different, distinct neighborhood, they are independent observations.
Linearity of the logit: The independent variables are linearly related to the log odds.
$logit(p(x)) = log \frac{p(x)}{1-p(x)} = \beta_0 + \beta_{1x_1} + \ldots + \beta_{kx_k}$
Test shows the fit of the model. Use a scatter plot to check the linear relationship of the continuous variables and the logit of the outcome, via function mmp().
There are no highly influential outliers in the variables.
Test during data exploration and prepartation with various visualizations and during model results using Cook’s distance.
Absence of multicollinearity: The independent variables are not too highly correlated with each other. This can be tested using a correlation matrix, looking for values of ~0.8 or higher.
Test during data exploration and preparation using corrplot() for numeric values and other visualizations for non-numeric values.

Unlike linear models, the logistic model does NOT require homoscedasticity or normality.

Loading Data

Here we load in the data from the CSVs which we have downloaded. The data is separated into a training set for building the model and an evaluation set for testing the model.

## Rows: 466
## Columns: 13
## $ zn      <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 100, 20, 0…
## $ indus   <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5.19, 3.6…
## $ chas    <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox     <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693, 0.515,…
## $ rm      <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519, 6.316,…
## $ age     <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38.1, 19.1,…
## $ dis     <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896, 1.6582…
## $ rad     <dbl> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5, 5, 24, …
## $ tax     <dbl> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330, 398, 66…
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, 16.4, 19…
## $ lstat   <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5.68, 9.25…
## $ medv    <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20.9, 24.8…
## $ target  <dbl> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…

Description of Variables

Below is a short description of the variables of interest in the data set:

Variable	xxxx	Definition
zn		Proportion of residential land zoned for large lots
indus		proportion of non-retail business acres per suburb
chas		a dummy var. for whether the suburb borders the Charles River (1) or not (0)
nox		nitrogen oxides concentration (parts per 10 million)
rm		average number of rooms per dwelling
age		proportion of owner-occupied units built prior to 1940
dis		weighted mean of distances to five Boston employment centers
rad		index of accessibility to radial highways
tax		full-value property-tax rate per $10,000
ptratio		pupil-teacher ratio by town
lstat		lower status of the population (percent)
medv		median value of owner-occupied homes in $1000s
target		whether the crime rate is above the median crime rate (1) or not (0)

1 Data Exploration

Data Skimming

We begin our data exploration by skimming the test data for a quick view of summary statistics, data types, and a visualization of distributions with mini histograms.

We have 466 rows with 13 columns that are all numeric (12 predictors and the target).

Note here that there are no missing values.

Data summary
Name	train1
Number of rows	466
Number of columns	13
_______________________
Column type frequency:
numeric	13
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
zn	1	11.58	23.36	0.00	0.00	0.00	16.25	100.00	▇▁▁▁▁
indus	1	11.11	6.85	0.46	5.15	9.69	18.10	27.74	▇▆▁▇▁
chas	1	0.07	0.26	0.00	0.00	0.00	0.00	1.00	▇▁▁▁▁
nox	1	0.55	0.12	0.39	0.45	0.54	0.62	0.87	▇▇▅▃▁
rm	1	6.29	0.70	3.86	5.89	6.21	6.63	8.78	▁▂▇▂▁
age	1	68.37	28.32	2.90	43.88	77.15	94.10	100.00	▂▂▂▃▇
dis	1	3.80	2.11	1.13	2.10	3.19	5.21	12.13	▇▅▂▁▁
rad	1	9.53	8.69	1.00	4.00	5.00	24.00	24.00	▇▂▁▁▃
tax	1	409.50	167.90	187.00	281.00	334.50	666.00	711.00	▇▇▅▁▇
ptratio	1	18.40	2.20	12.60	16.90	18.90	20.20	22.00	▁▃▅▅▇
lstat	1	12.63	7.10	1.73	7.04	11.35	16.93	37.97	▇▇▅▂▁
medv	1	22.59	9.24	5.00	17.02	21.20	25.00	50.00	▂▇▅▁▁
target	1	0.49	0.50	0.00	0.00	0.00	1.00	1.00	▇▁▁▁▇

Binariness

Confirming that our target variable has values of only 0 or 1 by showing a table of the counts of all values.

## target
##   0   1 
## 237 229

Histograms

The histograms in skim are small and some detail is lost. Taking a look at a bit more detail for some of the continuous variables.

The most striking imbalance in the histograms is that for % Large Lots (zn). Let’s dive in even more and see just how many rows have 0 here. In this table of counts by every value in % Large Lots (zn), notice that 339 of the rows (73%) have 0% zoned for large lots.

##    Mode   FALSE    TRUE 
## logical     127     339

We also notice in the histograms that:
- % NonRetail Business (indus) is bimodal
- % Built <1940 (age) is skewed left
- Dist to Employment (dis) is skewed right
- Tax Rate (tax) is multimodal
- Pupil/Teacher (ptratio) is skewed left

Box Plots

Box Plots of the predictor variables.
- Crime Rate Below Median in blue
- Crime Rate Above Median in orange

These show:
- where the variables are quite distinct for target=0 vs target=1: % NonRetail Business (indus) and Nitrogen Oxide (nox)
- where there is little difference for target=0 vs target=1: Borders Charles River (chas)
- there may be outliers to be addressed: Tax Rate (tax) and Pupil/Teacher (ptratio)

At this point, we would like to take a closer look at Borders Charles River (chas). As we can see in the table below, 93% of neighborhoods do not border the Charles River.

##     
## chas  0 1 Total
##      93 7   100

And if we look by target, we see 95% do not border on the Charles when the neighborhood has a crime rate below median, and 91% do not border when the crime rate is above median.

##       chas
## target   0   1 Total
##      0  95   5   100
##      1  91   9   100

Jitter Plots

The jitter plots show the “1s” (above median crime rate) in orange along the top and the “0s” (below median crime rate) in blue along the bottom. It is jittered both in width and height, so it does not show exact amounts, but it supplements the box plots in showing how the target (crime rate) and each of the continuous variables relate.

In the % Large Lots (zn) jitter, one can more easily see that around 0% and 25% Large Lots, there are neighborhoods both above and below the median crime rate. Above ~25% Large Lots, however all neighborhoods are below the median crime rate.

Similarly, in the % NonRetail Business (indus) jitter, below ~15% one can see both above and below median crime rate, then between ~15 and ~25% it is almost all above median crime rate (with 1 exception) and above ~25% the neighborhoods are all below the median crime rate.

The Borders Charles River jitter shows the quad that results from plotting 2 binary variables, and that there is not much difference in the neighboorhoods above and below the median crime rate.

The Nitrogen Oxide (nox) jitter shows for neighborhoods where the crime rate is below the median, the concentration of nitrogen oxide below ~.65, while those where the crime rate is above the median the levels of nitrogen oxide get higher.

Summarizing the other plots:
- Avg Rooms (rm) below median crime neighborhoods average between ~5-8 rooms, while above median crime neighborhoods scatter from ~4 to ~9.
- % Built <1940 (age) is pretty evenly scattered for those below the median crime rate, but is leans toward a higher percentage for neighborhood above the median crime rate.
- Dist to Employment (dis) the neighborhoods with crime rates above the median cluster to ~5 or below with two exceptions.
- Index: Access to Hwys (rad) most neighborhoods are below ~8, while a few are in the 25 range - and those are all above the median crime rate.
- Tax Rate (tax) has most neighborhood in the ~200 to 400 range, with a few (both above and below the median crime rate) in the 700 range.
- Pupil/Teacher (ptratio) the jitters above and below the median crime rate span roughly the same range, with perhaps more “step-ness” in those above.
- Lower Status (lstat) has a wider range for neighborhoods above the median crime rate, from about 0 to about 40, where most of those below the median crime rate fall below 20 (with a few exceptions).
- Med Home Value (medv) shows neighborhoods with a crime rate above the median clustering more toward the lower value, but with a long tail in the same range as those with crime rates below the median.

Outliers

To find outliers, we will use the scores() function from the outliers package, with type set to “iqr” and the lim(it)=1.5. This will identify any rows with values that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3.

We will create flags for each variable to identify potential outliers. Deciding how these will be handled will happen in the Data Preparation Section.

Using this method, we created:
- znOUT: Potential Outliers in % Large Lots (zn)
- chasOUT: Potential Outliers in Borders Charles River (chas)
- rmOUT: Potential Outliers in Avg Rooms (rm)
- disOUT: Potential Outliers in Dist to Employment (dis)
- lstatOUT: Potential Outliers in Lower Status (lstat)
- medvOUT: Potential Outliers in Med Home Value (medv)

The table below shows the resulting values (TRUE is a potential outlier).

Note that this method did not identify potential outliers for the remaining variables.

##       znOUT chasOUT rmOUT disOUT lstatOUT medvOUT
## FALSE   418     433   438    461      460     428
## TRUE     48      33    28      5        6      38

Correlation

Before we move to Data Preparation, we can use corrplot() to see what might have a significant correlation with target, or with another variable. We will look only at the original variables, not those created to flag potential outliers.
- Borders Charles River (chas) is not very strongly correlated to target (or much else). As we saw above, 93% do not border the Charles.
- Avg Rooms (rm) is not highly correlated with target, but it does correlates strongly with Med Home Value (medv) and has an inverse correlation with Lower Status (lstat).
- In general, there is lots of positive and negative correlation among the variables, so we should look for multicollinearity.
- Nitrogen Oxide (nox) has the strongest positive correlation to the target, followed by % Built <1940 (age), Index: Access to Hwys (rad), Dist to Employment (dis) and % NonRetail Business (indus).
- % Large Lots (zn) has the strongest negative correlation.

2 Data Preparation

After exploring our data we move on to preparing it to be more useful for our goal of creating a model to predict if the crime rate is above the median or not.

Note that we have already determined that there are no missing values.

Note also that we created flags to identify potential outliers.

We will create a new dataframe, train2, for these transformations, so we can compare them to the original when we model.

Factorization

From the exploration above, it looks like it might be valuable to take some of the continuous variables and make them into categorical ones.

% Large Lots (zn)

Interested in trying this two different ways:
Zone Option 1 - a boolean - FALSE if 0% is zoned for large lots (zn=0), otherwise TRUE. As the table shows, this results in 339 rows with FALSE and 127 rows with TRUE.

##    Mode   FALSE    TRUE 
## logical     339     127

Zone Option 2 - a factor, where the values are grouped into ranges. As this table shows, most values (339) are in the ‘0’ bin, with 54, 31, and 42 in the ‘<=25%’, ‘26-50%’ and ‘51-100%’, respectively.

##       0   <=25%  26-50% 51-100% 
##     339      54      31      42

% NonRetail Business

We will create a variable (indus1) that groups % NonRetail Business (indus) into three categories: low, med and high, with the distribution shown in the table below.

##  low  med high 
##  143  150  173

Tax Rate

Similarly, we create new variable (tax1) that groups Tax Rate (tax) into three categories: low, med and high, with the distribution shown in the table below.

##  low  med high 
##  154  186  126

Visualizing these new categories, showing the categories grouped by above and below crime rate (target).

Outliers

Looking at the set of outlier flags created above, we want to calculate how many outliers each row has.

As the table shows, there is 1 row with 4 outliers and there are 5 rows with 3 outliers (of the 12 original predictor variables, 6 of which has potential outliers). For these 6 rows, either a third (4/12) or quarter (3/12) of the predictor variables are outliers. We will eliminate these 6 rows.

## sumOUT
##   0   1   2   3   4 
## 351  79  30   5   1

Eliminate Columns

As we saw above, Borders Charles River (chas) shows 93% of the neighborhoods do not border the Charles. And this variable correlated with none of the others, including target. We will remove it, and the outlier flag associated with it.

Avg Rooms (rm) has a strong positive correlation with Med Home Value (medv) and negative correlation with Lower Status (lstat). We take out Avg Rooms (rm), since it has a a weaker correlation with target than the other 2. We will also remove its outlier flag.

While we are at it, we will also remove the sumOUT field used to identify rows with multiple outliers.

Correlation

To see where we are with the numeric fields. The only difference from the correlation in the previous section is Med Home Value (medv) has a slightly lower correlation and Nitrogen Oxide (nox) has a slightly high one.

3 Build Models

Model 1

Description

The first model we will tackle is a binary regression model from the untransformed data (the original data set). This is to be used as a baseline determine if our transformations may have had the desired effect on the regression model and strengthened it.

We will create the model, and then use stepAIC() with direction=both to refine it.

## 
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio + 
##     medv, family = binomial, data = train1)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -37.415922   6.035013  -6.200 5.65e-10 ***
## zn           -0.068648   0.032019  -2.144  0.03203 *  
## nox          42.807768   6.678692   6.410 1.46e-10 ***
## age           0.032950   0.010951   3.009  0.00262 ** 
## dis           0.654896   0.214050   3.060  0.00222 ** 
## rad           0.725109   0.149788   4.841 1.29e-06 ***
## tax          -0.007756   0.002653  -2.924  0.00346 ** 
## ptratio       0.323628   0.111390   2.905  0.00367 ** 
## medv          0.110472   0.035445   3.117  0.00183 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 197.32  on 457  degrees of freedom
## AIC: 215.32
## 
## Number of Fisher Scoring iterations: 9

Model 1 Results

Model 1 starts with using all predictor variables and an AIC value of 218.05.

Next, it removes Avg Rooms (rm) and results in an AIC value of 216.71.

It goes through 3 more iterations, and ends up including the following variables: zn, nox, age, dis, rad, tax, ptratio and medv. The result has an AIC value of 215.32, a BIC of 252.62, an $AIC_c$ of 215.72 and an pseudo $R^2$ of 0.69.

Recall that in logistic regression, $R^2$ does not have the same interpretation as in linear regression. Here we are using McFadden’s $R^2$. This is not the percentage of variance explained by the logistic model, but rather a ratio indicating how close is the fit to being perfect or the worst. If the model is a good fit, the value will be close to 1.

We can see here the Null deviance (with only the intercept) has a deviance of 645.88 whereas the Residual deviance with all the variables is 192.05. So still high, but certainly closer to 0 with the variables.

Looking at the logit of odds of the intercept (-37.415922), we get a really small number (a decimal that starts with 16 zeros). This indicates the odds of having a crime rate higher than the median increases by almost 100% when we have predictor variables. Or, almost all of the odds based on the variables.

Looking at the Margin Model Plots we see good fits between the data and the model, with some issues with % Large Lots (zn) on the left, potential impact of outliers with Index: Access to Hwys (rad) and Tax rate (tax).

Model 2

Description

Model 2 will use the train2 data set with all the transformations above, except we will remove all the outlier flags. As with Model 1, we will then use stepAIC() with direction=both to refine it.

## 
## Call:
## glm(formula = target ~ zn + indus + nox + dis + tax + lstat + 
##     medv + zn2 + indus1 + tax1, family = binomial, data = train2)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.400e+01  7.354e+00  -4.624 3.77e-06 ***
## zn           1.226e+00  2.400e-01   5.106 3.29e-07 ***
## indus       -9.584e-01  1.852e-01  -5.175 2.28e-07 ***
## nox          5.755e+01  1.030e+01   5.587 2.31e-08 ***
## dis          8.573e-01  3.060e-01   2.802 0.005080 ** 
## tax         -2.692e-02  6.682e-03  -4.029 5.61e-05 ***
## lstat        1.883e-01  6.847e-02   2.750 0.005954 ** 
## medv         1.881e-01  6.144e-02   3.061 0.002204 ** 
## zn2<=25%    -2.359e+01  4.705e+00  -5.015 5.30e-07 ***
## zn226-50%   -6.807e+01  3.087e+03  -0.022 0.982410    
## zn251-100%  -1.338e+02  1.746e+03  -0.077 0.938922    
## indus1med    8.022e+00  1.486e+00   5.398 6.75e-08 ***
## indus1high   2.018e+01  3.620e+00   5.575 2.48e-08 ***
## tax1med      4.378e+00  1.051e+00   4.166 3.11e-05 ***
## tax1high     1.320e+01  3.395e+00   3.887 0.000101 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 637.62  on 459  degrees of freedom
## Residual deviance: 109.78  on 445  degrees of freedom
## AIC: 139.78
## 
## Number of Fisher Scoring iterations: 20

Model 2 Results

Model 2 starts with using all predictor variables in our transformed data and an AIC value of 144.88. This is already an improvement over the untransformed data, which started at 218.05 and ended with 215.32.

Next it removes zn1, the boolean we created from % Large Lots (zn) and results in an AIC value of 143.04

It goes through 2 more iterations, and ends up including the following variables: zn, indus, nox, dis, tax, lstat, medv, zn2, indus1 and tax1. The final result has an AIC value of 139.78, a BIC of 201.75, an $AIC_c$ of 140.86 and an pseudo $R^2$ of 0.83.

We can see here the Null deviance (with only the intercept) has a deviance of 637.62 whereas the Residual deviance with all the variables is 109.78.

All of this shows improvements over Model 1.

Looking at the logit of odds of the intercept (-34.00), we get a really small number (a decimal that starts with 14 zeros). This indicates the odds of having a crime rate higher than the median increases by almost 100% when we have predictor variables. Or, almost all of the odds based on the variables.

In the Margin Model Plots we see generally good fits between the data and the model, with some separation of the data and model lines on the left in % NonRetail Business (indus). We also see the same issues with % Large Lots (zn) on the left that we saw in model 1, potential impact of outliers with Tax rate (tax).

The coefficients and confidence intervals in Model 2 can be interpreted as follows. We will use % of Large Lots (zn) in our example, but the same structure applies to all other predictor variables.

Coefficients

A 1 unit increase in % of Large Lots (zn) will result in 3.4 increase in the log-odds of the crime rate being above the median (success) vs the crime rate being below the median (failure). Since exp(3.4) is 29.96, this is a 2,896% increase in the odds.

Confindence Interval (CI)

We can say with 95% that $\beta_0$ there is NOT a significant difference in the probability of target=1 vs target=0 at the intercept. Target = 1 is between (0,0) more likely than target =0.
Note that we are rounding the very very small numbers here. The actual CI is:
(0.0000000000000000000002007277, 0.0000000009124242)

% of Large Lots (zn) has a significant effect on the probability, with each unit of increase in % of Large Lots (zn), producing an increase in the odds of target by a factor of between (2.255561, 6.014934)

## Waiting for profiling to be done...

##                       OR        2.5 %       97.5 %
## (Intercept) 1.711957e-15 2.007277e-22 9.124242e-10
## zn          3.406073e+00 2.255561e+00 6.014934e+00
## indus       3.835002e-01 2.535931e-01 5.291857e-01
## nox         9.890647e+24 1.047587e+17 5.687067e+34
## dis         2.356786e+00 1.320708e+00 4.463804e+00
## tax         9.734384e-01 9.597484e-01 9.855687e-01
## lstat       1.207229e+00 1.059689e+00 1.390679e+00
## medv        1.206931e+00 1.077263e+00 1.374185e+00
## zn2<=25%    5.662022e-11 6.619600e-16 1.659356e-07
## zn226-50%   2.741229e-30 0.000000e+00 1.330340e+22
## zn251-100%  7.702101e-59 0.000000e+00 4.527671e-30
## indus1med   3.047015e+03 2.176195e+02 8.033263e+04
## indus1high  5.816216e+08 1.047436e+06 1.838360e+12
## tax1med     7.964049e+01 1.186983e+01 7.713198e+02
## tax1high    5.385323e+05 1.217929e+03 9.981182e+08

Confusion Matrix

In the confusion matrix below we show:
- 223 accurately predicted 0s (crime not above median) AKA True Negatives and 9 inaccurate 0s (AKA False Negatives - Type 2 error).
- 218 accurately predicted 1s (crime above median) AKA True Positives and 10 inaccurate 1s (AKA False Positives - Type 1 error).

##        
##         pred 0 pred 1
##   obs 0    223     10
##   obs 1      9    218

Accuracy

Using the Confusion Matrix, we calculate the accuracy of our model is 95.87%. That is, the model is able to classify 95.87% of all observations.

Sensitivity

Sensitivity (aka recall or true positive rate) indicates how often does our model predicts actual TRUE from the overall TRUE events. For Model 2 it is 96.04%.

Specificity

Specificity or the true negative rate, indicating how often our model predicts actual FALSE from the overall FALSE events for Model 2 is 95.71%

Classification Error Rate

The classification error rate for Model 2 is 4.13%, as expected since it is the inverse of Accuracy.

Precision

Precision, which indicates how often the predicted TRUE values are actually TRUE is 95.61% for Model 2.

F-Score

F-Score is a harmonic mean of sensitivity (recall or true positive rate) and precision. The score value lies between 0 and 1. The value of 1 represents perfect balance between precision & recall. The value 0 represents the worst case. For Model 2, the F-Score is 0.96.

ROC Curve

The area under the ROC Curve is an index of accuracy. ROC curve is a line plot that is drawn between TPR and TNR. The higher the area under the curve (AUC), the better the prediction power of the model. An AUC of 1 is a perfect predictive model. An AUC value of greater than .70 indicates a good model.

For Model 2, we have an AUC of 0.99.

## 
## Call:
## roc.formula(formula = target ~ model2$fitted.values, data = train2,     plot = TRUE, main = "ROC CURVE", col = "#6aaec8")
## 
## Data: model2$fitted.values in 233 controls (target 0) < 227 cases (target 1).
## Area under the curve: 0.9901

Influential values

We will using Cook’s Distance values to find influential values. Not all outliers are influential observations. To check whether the data contains potential influential observations, the standardized residual error can be inspected. Data points with an absolute standardized residuals above 3 represent possible outliers and may deserve closer attention.

When we plot Cook’s Distance, we identify the top 3 distances, as shown below, observations 14, 356 and 451. But when we filter for rows with an absolute standardized residuals above 3, we get 0 rows, so there are no influential outliers.

Multicollinearity

We use the Variance Inflation Factor (VIF) to check for multicollinearity.

A value of 1 means no correlation
between 1 and 5 is moderate correlation, not usually severe enough to require attention
a value that > 5 (or 10, depending on the source) indicates a problematic amount of collinearity. In this case, the coefficient estimates and p-values in the regression output are likely unreliable.

Note that in the table below we renamed the column GVIF^(1/(2*Df)) as ‘result’ to make it easier to reference when we graph it.

% Large Lots (zn) stands out as problematic. Note that ithe model has both the original field, and a factorized version of the field (zn1). This will be addressed in Model 3.

##             GVIF Df   result
## zn     69.640547  1 8.345091
## indus  22.160892  1 4.707536
## nox     5.926183  1 2.434375
## dis     4.601422  1 2.145093
## tax     6.136864  1 2.477269
## lstat   2.351916  1 1.533596
## medv    3.532100  1 1.879388
## zn2    61.224581  3 1.985276
## indus1 65.927776  2 2.849490
## tax1    7.695479  2 1.665555

Model 3

Description

We have read some warnings about using the automated stepwise functions, so we thought we would try a model where we choose predictor variables to include by hand, using the information in the Data Exploration, Data Preparation and Model 2 Results section above. We also want to address the multicollinearity found with % Large Lots (zn) and the created field zn2 in Model 2.

After many many manual iterations of adding and subtracting, we land on the fields nox, tax, tax1, indus, indus1, rad, zn, and of course target.

We also modify zn to be a boolean based on if the % is <25%.

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = train3)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.310e+01  2.915e+00  -4.494 6.97e-06 ***
## nox          3.272e+01  5.970e+00   5.480 4.26e-08 ***
## tax         -2.184e-02  5.212e-03  -4.190 2.79e-05 ***
## tax1med      2.727e+00  8.253e-01   3.304 0.000954 ***
## tax1high     4.495e+00  9.872e+00   0.455 0.648871    
## indus       -6.129e-01  1.607e-01  -3.815 0.000136 ***
## indus1med    3.538e+00  1.138e+00   3.109 0.001879 ** 
## indus1high   1.242e+01  3.160e+00   3.931 8.47e-05 ***
## rad          4.951e-01  2.264e-01   2.187 0.028768 *  
## znTRUE      -1.668e+01  1.016e+03  -0.016 0.986910    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 637.62  on 459  degrees of freedom
## Residual deviance: 155.06  on 450  degrees of freedom
## AIC: 175.06
## 
## Number of Fisher Scoring iterations: 18

Model 3 Evaluation

Model 3 a AIC of 175.06, a BIC of 216.37, an $AIC_c$ of 175.55 and an pseudo $R^2$ of 0.76.

We can see here the Null deviance (with only the intercept) has a deviance of 637.62 whereas the Residual deviance with all the variables is 155.06.

All of this shows improvements over Model 1, but not over Model 2. However, Model to had some problematic multicolinearity, which makes it’s results suspect.

Looking at the logit of odds of the intercept (-13.10), we get a really small number (a decimal that starts with 5 zeros). This indicates the odds of having a crime rate higher than the median increases by almost 100% when we have predictor variables. Or, almost all of the odds based on the variables.

Looking at the Margin Model Plots we see generally good fits between the data and the model, with some separation of the data and model lines on the left in % NonRetail Business (indus). % Large Lots (zn) is now a boolean.

## Error in names(dat) <- object$term : 
##   'names' attribute [1] must be the same length as the vector [0]
## Error in names(dat) <- object$term : 
##   'names' attribute [1] must be the same length as the vector [0]

The coefficients and confidence intervals in Model 3 can be interpreted as follows. We will use Index: Access to Hwys (rad) in our example, but the same structure applies to all other predictor variables.

Coefficients:
A 1 unit increase in Index: Access to Hwys (rad) will result in 1.64 increase in the log-odds of the crime rate being above the median (success) vs the crime rate being below the median (failure). Since exp(1.64) is 5.155, this is a 4155% increase in the odds.

Confidence Interval (CI):
We can say with 95% that $\beta_0$ there is NOT a significant difference in the probability of target=1 vs target=0 at the intercept. Target = 1 is between (0,0) more likely than target =0.
Note that we are rounding the very very small numbers here. The actual CI is:
( 0.000000002864240, 0.0003353086)

Index: Access to Hwys (rad) has a significant effect on the probability, with each unit of increase in Index: Access to Hwys (rad), producing an increase in the odds of target by a factor of between (1.131544, 2.789003)

## Waiting for profiling to be done...

##                       OR         2.5 %       97.5 %
## (Intercept) 2.043993e-06  2.864240e-09 3.353086e-04
## nox         1.616852e+14  6.231542e+09 1.626052e+20
## tax         9.783997e-01  9.675599e-01 9.878811e-01
## tax1med     1.528011e+01  3.263914e+00 8.585769e+01
## tax1high    8.959064e+01  2.443073e-04 7.459579e+05
## indus       5.417722e-01  3.864839e-01 7.286074e-01
## indus1med   3.438522e+01  4.177222e+00 3.748291e+02
## indus1high  2.479456e+05  7.469731e+02 1.986877e+08
## rad         1.640590e+00  1.131544e+00 2.789003e+00
## znTRUE      5.724841e-08 1.485107e-137 3.237840e+14

Confusion Matrix

In the confusion matrix below we show:
- 218 accurately predicted 0s (crime not above median) AKA True Negatives and 11 inaccurate 0s (AKA False Negatives - Type 2 error).
- 216 accurately predicted 1s (crime above median) AKA True Positives and 15 inaccurate 1s (AKA False Positives - Type 1 error).

##        
##         pred 0 pred 1
##   obs 0    218     15
##   obs 1     11    216

Accuracy

From Confusion Matrix, the accuracy of our model is 94.35%. That is, the model is able to classify 94.35% of all observations.

Sensitivity

Sensitivity (aka recall or true positive rate) indicates how often does our model predicts actual TRUE from the overall TRUE events. For Model 3 it is 95.15%.

Specificity

Specificity or the true negative rate, indicating how often our model predicts actual FALSE from the overall FALSE events for Model 3 is 93.56%

Classification Error Rate

The classification error rate for Model 3 is 5.65%, as expected since it is the inverse of Accuracy.

Precision

Precision, which indicates how often the predicted TRUE values are actually TRUE is 93.51% for Model 3.

F-Score

ROC Curve

For Model 3, we have an AUC of 0.98.

## 
## Call:
## roc.formula(formula = target ~ model3$fitted.values, data = train3,     plot = TRUE, main = "ROC CURVE", col = "#6aaec8")
## 
## Data: model3$fitted.values in 233 controls (target 0) < 227 cases (target 1).
## Area under the curve: 0.9798

Influential values

When we plot Cook’s Distance, we identify the top 3 distances, as shown below, observations 14, 56 and 257. But when we filter for rows with an absolute standardized residuals above 3, we get 0 rows, so there are no influential outliers.

Multicollinearity

We again use the Variance Inflation Factor (VIF) to check for multicollinearity.

Recall that in the table below we renamed the column GVIF^(1/(2*Df)) as ‘result’ to make it easier to reference when we graph it.

As you can see, the changes in the model have addressed the multicollinearity found in Model 2.

##             GVIF Df   result
## nox     2.061370  1 1.435747
## tax     3.676886  1 1.917521
## tax1    3.732575  2 1.389959
## indus  22.036334  1 4.694287
## indus1 41.603662  2 2.539703
## rad     1.928874  1 1.388839
## zn      1.000000  1 1.000000

4 Select Models

Though Model 2 showed “better” numbers in the results, there was an issue with % Large Lots (zn) when we checked for multicollinearity. There is an assumption of Logistic Models that there will be an absence of multicollinearity. Model 3 addressed this issue. The Model 3 results numbers, though lower, still showed the model was valid. Therefore, we selected Model 3.

Appendix 1: Test Data

We apply the same transformations on the evaluation data that ended up being used in our selected model: - bins for Tax Rate (tax) - bins for % NonRetail Business (indus) - transform % Large Lots (zn) into a Boolean where T >=25

We then adjust the results using 0.5 threshold.

See attached csv for results.

Appendix 2: Code

library(tidyverse)
library(skimr)      #skim
library(corrplot)   #corrplot
library(ggpubr)     #ggarrange
library(MASS)       #stepAIC
library(car)        #vif()
library(pROC)       #for AUC
library(broom)      #augment()
library(gamlr)      #AICc()
library(outliers)   #testing for outliers
library(qdapTools)  #boolean table of outliers
library(tigerstats) #rowPerc 
#load training data
train1 <- read_csv("/Users/dianaplunkett/CUNY/621 Business Analytics and Data Mining/Homework 3/crime-training-data_modified.csv", show_col_types = FALSE)

#load evaluation data
eval1 <- read_csv("/Users/dianaplunkett/CUNY/621 Business Analytics and Data Mining/Homework 3/crime-evaluation-data_modified.csv", show_col_types = FALSE)

#take a peak at the data
glimpse(train1)
# 1.Data Exploration
#Data Skimming
skim(train1)
#Binariness
#table showing count of values (0,1) for variable target
xtabs (~target, train1)
#Histograms
#Skim histos are small, so looking at some a bit closer
p1 <- train1 %>% 
  ggplot (aes(x=zn))+
    geom_histogram(binwidth=1, fill ="#6aaec8")+
    labs(x = "% Large Lots") +
    theme_classic()
p2 <- train1 %>% 
  ggplot (aes(x=indus))+
    geom_histogram(binwidth=5, fill ="#6aaec8")+
    labs(x = "% NonRetail Business") +
    theme_classic()
p3 <- train1 %>% 
  ggplot (aes(x=age))+
    geom_histogram(binwidth=5, fill ="#6aaec8")+
    labs(x = "% Built <1940") +
    theme_classic()
p4 <- train1 %>% 
  ggplot (aes(x=dis))+
    geom_histogram(binwidth=1, fill ="#6aaec8")+
    labs(x = "Dist to Employment") +
    theme_classic()
p5 <- train1 %>% 
  ggplot (aes(x=tax))+
    geom_histogram(binwidth=25, fill ="#6aaec8")+
    labs(x = "Tax Rate") +
    theme_classic()
p6 <- train1 %>% 
  ggplot (aes(x=ptratio))+
    geom_histogram(binwidth=1, fill ="#6aaec8")+
    labs(x = "Pupil/Teacher") +
    theme_classic()

ggarrange(p1,p2,p3,p4,p5,p6)
#A quick table to see just how many rows in zn are 0
summary(if_else(
    train1$zn ==0,TRUE, FALSE))
#Box Plots
theColors <- c('#6aaec8','#ef8300')

par(mar = c(3, 4, 1, 2))
par(mfrow = c(3,4))
boxplot(zn~target, ylab="% Large Lots", 
        col=theColors, data = train1)
boxplot(indus~target, ylab="% NonRetail Business",
        col=theColors,data = train1)
boxplot(chas~target, ylab="Borders Charles River",
        col=theColors,data = train1)
boxplot(nox~target, ylab="Nitrogen Oxide", 
        col=theColors,data = train1)
boxplot(rm~target, ylab="Avg Rooms", 
        col=theColors,data = train1)
boxplot(age~target, ylab="% Built <1940",
        col=theColors,data = train1)
boxplot(dis~target, ylab="Dist to Employment", 
        col=theColors,data = train1)
boxplot(rad~target, ylab="Index: Access to Hwys", 
        col=theColors,data = train1)
boxplot(tax~target, ylab="Tax Rate", 
        col=theColors,data = train1)
boxplot(ptratio~target, ylab="Pupil/Teacher", 
        col=theColors,data = train1)
boxplot(lstat~target, ylab="Lower Status", 
        col=theColors,data = train1)
legend(-.75,2, legend=c("Below Med", "Above Med"), col=theColors,lwd = 5, xpd = TRUE, horiz = FALSE, cex = .75, seg.len=.25, bty = 'n', title="Crime Rate")
boxplot(medv~target, ylab="Med Home Value", 
        col=theColors,data = train1)

#table showing count of values (0,1) for variable chas
round(rowPerc(xtabs (~chas, train1)),0)
#table showing count of values (0,1) for variable chas
round(rowPerc(xtabs (~target+chas, train1)),0)
#Jitter plots
p7 <- ggplot(train1, aes(zn, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
  scale_color_identity()+
  labs(x = "% Large Lots") +
  theme_classic(base_size = 8)

p8 <- ggplot(train1, aes(indus, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
  scale_color_identity()+
  labs(x = "% NonRetail Business") +
  theme_classic(base_size = 8)

p9 <- ggplot(train1, aes(chas, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
  scale_color_identity()+
  labs(x = "Borders Charles River") +
  theme_classic(base_size = 8)

p10 <- ggplot(train1, aes(nox, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
  scale_color_identity()+
  labs(x = "Nitrogen Oxide") +
  theme_classic(base_size = 8)

p11 <- ggplot(train1, aes(rm, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
  scale_color_identity()+
  labs(x = "Avg Rooms") +
  theme_classic(base_size = 8)

p12 <- ggplot(train1, aes(age, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
  scale_color_identity()+
  labs(x = "% Built <1940") +
  theme_classic(base_size = 8)

p13<- ggplot(train1, aes(dis, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
  scale_color_identity()+
  labs(x = "Dist to Employment") +
  theme_classic(base_size = 8)

p14<- ggplot(train1, aes(rad, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
  scale_color_identity()+
  labs(x = "Index: Access to Hwys") +
  theme_classic(base_size = 8)

p15<-ggplot(train1, aes(tax, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
  scale_color_identity()+
  labs(x = "Tax Rate") +
  theme_classic(base_size = 8)

p16 <- ggplot(train1, aes(ptratio, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
  scale_color_identity()+
  labs(x = "Pupil/Teacher") +
  theme_classic(base_size = 8)

p17<-ggplot(train1, aes(lstat, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
  scale_color_identity()+
  labs(x = "Lower Status") +
  theme_classic(base_size = 8)

p18<-ggplot(train1, aes(medv, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
  geom_jitter(width = 0.1, height=0.1,  alpha=0.3)+
  scale_color_identity()+
  labs(x = "Med Home Value")+
  theme_classic(base_size = 8)

ggarrange(p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18
          #,common.legend = TRUE <- this doesn't work
          )
#Outliers
#create new variable to flag potential outliers
train1$znOUT<-scores(train1$zn, type='iqr', lim=1.5)

train1$chasOUT<-scores(train1$chas, type='iqr', lim=1.5)

train1$rmOUT<-scores(train1$rm, type='iqr', lim=1.5)

train1$disOUT<-scores(train1$dis, type='iqr', lim=1.5)

train1$lstatOUT<-scores(train1$lstat, type='iqr', lim=1.5)

train1$medvOUT<-scores(train1$medv, type='iqr', lim=1.5)

#using library(qdapTools) from qdap package to show table with the values of all the OUT variables
t(mtabulate(train1[grep('OUT',names(train1))]))
# Correlation
# want to exclude the new OUTlier variables.
M <- train1[grep('OUT',names(train1), invert = TRUE)]%>%cor()

corrplot.mixed(M, tl.srt = 45, tl.cex = 0.5, na.label = "square", na.label.col = "lightgrey", tl.col = 'black', number.cex = 0.5)
#Data Prep
#make a new df so we can run models on original and transformed data
train2 <- train1 
#zn1 is the boolean option for zn 
train2 <- train2 %>% mutate (zn1 = if_else(
    zn ==0,FALSE, TRUE))
summary(train2$zn1)
#zn2 is the binned option for zn
train2$zn2 <- cut(train2$zn, c(-1,1,25,50,100), labels=c('0','<=25%','26-50%','51-100%'))

summary(train2$zn2)
#bin % NonRetail Business
train2$indus1 <- cut(train2$indus, c(0,6,18,30), labels = c('low','med', 'high'))

summary(train2$indus1)
#bin tax rate
train2$tax1 <- cut(train2$tax, c(0,300,600,800), 
                        labels = c('low','med', 'high')
                        )

summary(train2$tax1)
#group & viz for zn1
zn1 <- train2 %>% group_by(zn1, target) %>% summarise(counts=n()) 

p19 <- ggplot(zn1, aes(fill=as.factor(target), x = zn1, y = counts, )) +
  geom_bar(position="dodge", stat='identity' ) +
  scale_fill_manual(values = c("#6aaec8", "#ef8300"), labels=c('Below Med', 'Above Med'))+
   labs(fill='Crime Rate', x='Zoned for large lots?')+
  theme_classic(base_size = 8)+
  theme(legend.position="none")

#group & viz for zn2

zn2 <- train2 %>% group_by(zn2, target) %>% summarise(counts=n()) 

p20 <- ggplot(zn2, aes(fill=as.factor(target), x = zn2, y = counts, )) +
  geom_bar(position="dodge", stat='identity' ) +
  scale_fill_manual(values = c("#6aaec8", "#ef8300"), labels=c('Below Med', 'Above Med'))+
   labs(fill='Crime', x=' % Zoned for Large Lots')+
  theme_classic(base_size = 8)+
  theme(legend.position="none")

#group & viz for indus1

indus1 <- train2 %>% group_by(indus1, target) %>% summarise(counts=n()) 

p21 <- ggplot(indus1, aes(fill=as.factor(target),x= indus1, y = counts, )) +
  geom_bar(position="dodge", stat='identity' ) +
  scale_fill_manual(values = c("#6aaec8", "#ef8300"), labels=c('Below Med', 'Above Med'))+
   labs(fill='Crime', x=' % NonRetail Business')+
  theme_classic(base_size = 8)+
  theme(legend.position="none")

#group & viz for tax1

tax1 <- train2 %>% group_by(tax1, target) %>% summarise(counts=n()) 

p22 <- ggplot(tax1, aes(fill=as.factor(target), x = tax1, y = counts, )) +
  geom_bar(position="dodge", stat='identity' ) +
  scale_fill_manual(values = c("#6aaec8", "#ef8300"), labels=c('Below Med', 'Above Med'))+
   labs(fill='Crime', x=' Tax Rate')+
  theme_classic(base_size = 8)
#plot all 4 new category variables
ggarrange(p19, p20, p21,p22)
#calc the number of outliers in each row and save value in sumOUT
train2$sumOUT <-rowSums(train2[grep('OUT',names(train2))]) 
xtabs (~sumOUT, train2)
#eliminate rows with outliers in 3 or more fields.
train2 <-subset(train2,sumOUT<=2)
train2<-subset(train2,select=-c(chas,chasOUT,rm,rmOUT, sumOUT))
# Correlation
# want to exclude the  OUTlier variables and the factors we created, as they are not numeric
M2 <- select_if(train2, is.numeric)%>%cor()

corrplot.mixed(M2, tl.srt = 45, tl.cex = 0.5, na.label = "square", na.label.col = "lightgrey", tl.col = 'black', number.cex = 0.5)
#Note that first we have to remove the outlier flags
train1<-train1[grep('OUT',names(train1), invert = TRUE)]
#model1 on original data
model1 <- glm(target ~ ., family = binomial, data = train1)
#model1.2 using stepAIC on model1 (original data)
model1.2 <- stepAIC(model1, direction='both',trace=0)
summary(model1.2)
BIC(model1.2)
AICc(model1.2)
# Computation of the R^2 with a function
r2Log <- function(model) {

  summaryLog <- summary(model)
  1 - summaryLog$deviance / summaryLog$null.deviance

}
r2Log(model1.2)
#Looking at the logit of odds of the intercept (-37.415922)
format(exp(-37.415922), scientific = FALSE)
mmps(model1.2)
#Note that first we have to remove the outlier flags
train2<-train2[grep('OUT',names(train2), invert = TRUE)]
#model2 on transformed data
model2 <- glm(target ~ ., family = binomial, data = train2)
summary(model2)
#model2.2 using stepAIC on model1 (original data)
model2.2 <- stepAIC(model2, direction='both', trace = 0)
summary(model2.2)
BIC(model2.2)
AICc(model2.2)
r2Log(model2.2)
#Looking at the logit of odds of the intercept (-34.00)
format(exp(-34.00), scientific = FALSE)
#Marginal Model Plots
mmps(model2.2)
#coefficients and confidence intervals
exp(cbind(OR = coef(model2.2), confint(model2.2)))
#Confusion Matrix

#classify the prediction as 1 (crime above med) if the fitted value exceeds 0.5 otherwise 0 (crime below).
train2$predict2 <- ifelse(model2.2$fitted.values >0.5,1,0)

model2table <-table(train2$target, train2$predict2)
rownames(model2table)<-c('obs 0', 'obs 1')
colnames(model2table)<-c('pred 0', 'pred 1')
model2table
# Accuracy 
(eff2<- sum(diag(model2table))/sum(model2table)*100)
# Sensitivity
(TPR2<- (model2table[2,2]/sum(model2table[2,]))*100)
# Specificity
(TNR2<- (model2table[1,1]/sum(model2table[1,]))*100)
# Classification Error Rate
cer2 <- 100 - eff2 #eff2=accuracy
# Precision
PREC2<- (model2table[2,2]/sum(model2table[,2]))*100
# F-Score
F_Score2 <- (2 * PREC2 * TPR2 / (PREC2 + TPR2))/100
# ROC Curve to get AUC
roc(target~model2$fitted.values, data = train2, plot = TRUE, main = "ROC CURVE", col= "#6aaec8")
#Influential values
#Cook's Distance
plot(model2, which = 4, id.n=3)
# Extract model results
cooks2 <- augment(model2) %>% 
  mutate(index = 1:n()) 
#filter for rows with an absolute standardized residuals above 3
cooks2 %>% filter(abs(.std.resid) > 3)
# Multicollinearity
vif2<-vif(model2.2)
colnames(vif2)[3]<-'result'
vif2<-as.data.frame(vif2)
(vif2)

barplot(vif2$result, names=rownames(vif2), main = "VIF Values", cex.names =.75, col = "#6aaec8")

#add horizontal line at 5
abline(h = 5, lwd = 3, lty = 2)
train3 <- subset(train2,
              select=c(nox, tax,tax1, indus, indus1,rad,  zn,target))
#update zn to a Boolean option based on <25  
train3 <- train3 %>% mutate (zn = if_else(
    zn <25,FALSE, TRUE))
model3 <- glm(target ~ ., family = binomial, data = train3)
summary(model3)
BIC(model3)
AICc(model3)
r2Log(model3)
#Looking at the logit of odds of the intercept (-13.10)
format(exp(-13.10), scientific = FALSE)
#Marginal Model Plots
mmps(model3)
#coefficients and confidence intervals
exp(cbind(OR = coef(model3), confint(model3)))
#Confusion Matrix

#classify the prediction as 1 (crime above med) if the fitted value exceeds 0.5 otherwise 0 (crime below).
train3$predict3 <- ifelse(model3$fitted.values >0.5,1,0)

model3table <-table(train3$target, train3$predict3)
rownames(model3table)<-c('obs 0', 'obs 1')
colnames(model3table)<-c('pred 0', 'pred 1')
model3table
# Accuracy 
(eff3<- sum(diag(model3table))/sum(model3table)*100)
# Sensitivity
(TPR3<- (model3table[2,2]/sum(model3table[2,]))*100)
# Specificity
(TNR3<- (model3table[1,1]/sum(model3table[1,]))*100)
# Classification Error Rate
(cer3 <- 100 - eff3) #eff2=accuracy
# Precision
PREC3<- (model3table[2,2]/sum(model3table[,2]))*100
# F-Score
F_Score3 <- (2 * PREC3 * TPR3 / (PREC3 + TPR3))/100
# ROC Curve to get AUC
roc(target~model3$fitted.values, data = train3, plot = TRUE, main = "ROC CURVE", col= "#6aaec8")
#Influential values
#Cook's Distance
plot(model3, which = 4, id.n=3)
# Extract model results
cooks3 <- augment(model3) %>% 
  mutate(index = 1:n()) 
#filter for rows with an absolute standardized residuals above 3
cooks3 %>% filter(abs(.std.resid) > 3)
# Multicollinearity
vif3<-vif(model3)
colnames(vif3)[3]<-'result'
vif3<-as.data.frame(vif3)
(vif3)

barplot(vif3$result, names=rownames(vif3), main = "VIF Values", cex.names =.75, col = "#6aaec8", ylim = c(0, 5))

#add horizontal line at 5
abline(h = 5, lwd = 3, lty = 2)
#transform the data as needed for model
#bin tax rate
eval1$tax1 <- cut(eval1$tax, c(0,300,600,800), 
                        labels = c('low','med', 'high')
                        )

#bin % NonRetail Business
eval1$indus1 <- cut(eval1$indus, c(0,6,18,30), labels = c('low','med', 'high'))

#update zn to a Boolean option based on <25  
eval1 <- eval1 %>% mutate (zn = if_else(
    zn <25,FALSE, TRUE))

#predict
eval_results <- predict(model3, eval1, type ='response')
#adjust using 0.5 threshold
eval_results <- ifelse(eval_results > 0.5, 1, 0)
#write csv for submission
write.csv(as.data.frame(eval_results), "/Users/dianaplunkett/CUNY/621 Business Analytics and Data Mining/Homework 3/HW3_GRP5.csv", row.names=TRUE)

Appendix 3: Reference Links

D621-HW3: Crime Rate (binary logistic regression)

Group 5: Taha Ahmad, PK O’Flaherty, Justin Williams, Diana Plunkett

2023-10-25

Introduction

Assumptions for Logistic Models

Loading Data

Description of Variables

1 Data Exploration

Data Skimming

Binariness

Histograms

Box Plots

Jitter Plots

Outliers

Correlation

2 Data Preparation

Factorization

% Large Lots (zn)

% NonRetail Business

Tax Rate

Outliers

Eliminate Columns

Correlation

3 Build Models

Model 1

Description

Model 1 Results

Model 2

Description

Model 2 Results

Coefficients

Confindence Interval (CI)

Confusion Matrix

Accuracy

Sensitivity

Specificity

Classification Error Rate

Precision

F-Score

ROC Curve

Influential values

Multicollinearity

Model 3

Description

Model 3 Evaluation

Confusion Matrix

Accuracy

Sensitivity

Specificity

Classification Error Rate

Precision

F-Score

ROC Curve

Influential values

Multicollinearity

4 Select Models

Appendix 1: Test Data

Appendix 2: Code

Appendix 3: Reference Links