We will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city.
Each record has 12 predictor variables and a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
The model will be a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels.
Binariness: The target variable is either 0 or
1.
Test is done in Data Exploration
Independence: Each observation is
independent.
Since each row is for a different, distinct neighborhood, they are
independent observations.
Linearity of the logit: The independent
variables are linearly related to the log odds.
\(logit(p(x)) = log \frac{p(x)}{1-p(x)} =
\beta_0 + \beta_{1x_1} + \ldots + \beta_{kx_k}\)
Test shows the fit of the model. Use a scatter plot to check the
linear relationship of the continuous variables and the logit of the
outcome, via function mmp().
There are no highly influential outliers in the
variables.
Test during data exploration and prepartation with various
visualizations and during model results using Cook’s
distance.
Absence of multicollinearity: The independent
variables are not too highly correlated with each other. This can be
tested using a correlation matrix, looking for values of ~0.8 or
higher.
Test during data exploration and preparation using corrplot() for
numeric values and other visualizations for non-numeric
values.
Unlike linear models, the logistic model does NOT require homoscedasticity or normality.
Here we load in the data from the CSVs which we have downloaded. The data is separated into a training set for building the model and an evaluation set for testing the model.
## Rows: 466
## Columns: 13
## $ zn <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 100, 20, 0…
## $ indus <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5.19, 3.6…
## $ chas <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693, 0.515,…
## $ rm <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519, 6.316,…
## $ age <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38.1, 19.1,…
## $ dis <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896, 1.6582…
## $ rad <dbl> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5, 5, 24, …
## $ tax <dbl> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330, 398, 66…
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, 16.4, 19…
## $ lstat <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5.68, 9.25…
## $ medv <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20.9, 24.8…
## $ target <dbl> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…
Below is a short description of the variables of interest in the data set:
| Variable | xxxx | Definition |
|---|---|---|
| zn | Proportion of residential land zoned for large lots | |
| indus | proportion of non-retail business acres per suburb | |
| chas | a dummy var. for whether the suburb borders the Charles River (1) or not (0) | |
| nox | nitrogen oxides concentration (parts per 10 million) | |
| rm | average number of rooms per dwelling | |
| age | proportion of owner-occupied units built prior to 1940 | |
| dis | weighted mean of distances to five Boston employment centers | |
| rad | index of accessibility to radial highways | |
| tax | full-value property-tax rate per $10,000 | |
| ptratio | pupil-teacher ratio by town | |
| lstat | lower status of the population (percent) | |
| medv | median value of owner-occupied homes in $1000s | |
| target | whether the crime rate is above the median crime rate (1) or not (0) |
We begin our data exploration by skimming the test data for a quick view of summary statistics, data types, and a visualization of distributions with mini histograms.
We have 466 rows with 13 columns that are all numeric (12 predictors and the target).
Note here that there are no missing values.
| Name | train1 |
| Number of rows | 466 |
| Number of columns | 13 |
| _______________________ | |
| Column type frequency: | |
| numeric | 13 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| zn | 0 | 1 | 11.58 | 23.36 | 0.00 | 0.00 | 0.00 | 16.25 | 100.00 | ▇▁▁▁▁ |
| indus | 0 | 1 | 11.11 | 6.85 | 0.46 | 5.15 | 9.69 | 18.10 | 27.74 | ▇▆▁▇▁ |
| chas | 0 | 1 | 0.07 | 0.26 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ▇▁▁▁▁ |
| nox | 0 | 1 | 0.55 | 0.12 | 0.39 | 0.45 | 0.54 | 0.62 | 0.87 | ▇▇▅▃▁ |
| rm | 0 | 1 | 6.29 | 0.70 | 3.86 | 5.89 | 6.21 | 6.63 | 8.78 | ▁▂▇▂▁ |
| age | 0 | 1 | 68.37 | 28.32 | 2.90 | 43.88 | 77.15 | 94.10 | 100.00 | ▂▂▂▃▇ |
| dis | 0 | 1 | 3.80 | 2.11 | 1.13 | 2.10 | 3.19 | 5.21 | 12.13 | ▇▅▂▁▁ |
| rad | 0 | 1 | 9.53 | 8.69 | 1.00 | 4.00 | 5.00 | 24.00 | 24.00 | ▇▂▁▁▃ |
| tax | 0 | 1 | 409.50 | 167.90 | 187.00 | 281.00 | 334.50 | 666.00 | 711.00 | ▇▇▅▁▇ |
| ptratio | 0 | 1 | 18.40 | 2.20 | 12.60 | 16.90 | 18.90 | 20.20 | 22.00 | ▁▃▅▅▇ |
| lstat | 0 | 1 | 12.63 | 7.10 | 1.73 | 7.04 | 11.35 | 16.93 | 37.97 | ▇▇▅▂▁ |
| medv | 0 | 1 | 22.59 | 9.24 | 5.00 | 17.02 | 21.20 | 25.00 | 50.00 | ▂▇▅▁▁ |
| target | 0 | 1 | 0.49 | 0.50 | 0.00 | 0.00 | 0.00 | 1.00 | 1.00 | ▇▁▁▁▇ |
Confirming that our target variable has values of only 0 or 1 by showing a table of the counts of all values.
## target
## 0 1
## 237 229
The histograms in skim are small and some detail is lost. Taking a look at a bit more detail for some of the continuous variables.
The most striking imbalance in the histograms is that for % Large Lots (zn). Let’s dive in even more and see just how many rows have 0 here. In this table of counts by every value in % Large Lots (zn), notice that 339 of the rows (73%) have 0% zoned for large lots.
## Mode FALSE TRUE
## logical 127 339
We also notice in the histograms that:
- % NonRetail Business (indus) is bimodal
- % Built <1940 (age) is skewed left
- Dist to Employment (dis) is skewed right
- Tax Rate (tax) is multimodal
- Pupil/Teacher (ptratio) is skewed left
Box Plots of the predictor variables.
- Crime Rate Below Median in blue
- Crime Rate Above Median in orange
These show:
- where the variables are quite distinct for target=0 vs target=1: %
NonRetail Business (indus) and Nitrogen Oxide (nox)
- where there is little difference for target=0 vs target=1: Borders
Charles River (chas)
- there may be outliers to be addressed: Tax Rate (tax) and
Pupil/Teacher (ptratio)
At this point, we would like to take a closer look at Borders Charles River (chas). As we can see in the table below, 93% of neighborhoods do not border the Charles River.
##
## chas 0 1 Total
## 93 7 100
And if we look by target, we see 95% do not border on the Charles when the neighborhood has a crime rate below median, and 91% do not border when the crime rate is above median.
## chas
## target 0 1 Total
## 0 95 5 100
## 1 91 9 100
The jitter plots show the “1s” (above median crime rate) in orange along the top and the “0s” (below median crime rate) in blue along the bottom. It is jittered both in width and height, so it does not show exact amounts, but it supplements the box plots in showing how the target (crime rate) and each of the continuous variables relate.
In the % Large Lots (zn) jitter, one can more easily see that around 0% and 25% Large Lots, there are neighborhoods both above and below the median crime rate. Above ~25% Large Lots, however all neighborhoods are below the median crime rate.
Similarly, in the % NonRetail Business (indus) jitter, below ~15% one can see both above and below median crime rate, then between ~15 and ~25% it is almost all above median crime rate (with 1 exception) and above ~25% the neighborhoods are all below the median crime rate.
The Borders Charles River jitter shows the quad that results from plotting 2 binary variables, and that there is not much difference in the neighboorhoods above and below the median crime rate.
The Nitrogen Oxide (nox) jitter shows for neighborhoods where the crime rate is below the median, the concentration of nitrogen oxide below ~.65, while those where the crime rate is above the median the levels of nitrogen oxide get higher.
Summarizing the other plots:
- Avg Rooms (rm) below median crime neighborhoods average between ~5-8
rooms, while above median crime neighborhoods scatter from ~4 to
~9.
- % Built <1940 (age) is pretty evenly scattered for those below the
median crime rate, but is leans toward a higher percentage for
neighborhood above the median crime rate.
- Dist to Employment (dis) the neighborhoods with crime rates above the
median cluster to ~5 or below with two exceptions.
- Index: Access to Hwys (rad) most neighborhoods are below ~8, while a
few are in the 25 range - and those are all above the median crime
rate.
- Tax Rate (tax) has most neighborhood in the ~200 to 400 range, with a
few (both above and below the median crime rate) in the 700 range.
- Pupil/Teacher (ptratio) the jitters above and below the median crime
rate span roughly the same range, with perhaps more “step-ness” in those
above.
- Lower Status (lstat) has a wider range for neighborhoods above the
median crime rate, from about 0 to about 40, where most of those below
the median crime rate fall below 20 (with a few exceptions).
- Med Home Value (medv) shows neighborhoods with a crime rate above the
median clustering more toward the lower value, but with a long tail in
the same range as those with crime rates below the median.
To find outliers, we will use the scores() function from the outliers package, with type set to “iqr” and the lim(it)=1.5. This will identify any rows with values that are more than 1.5 IQR below Q1 or more than 1.5 IQR above Q3.
We will create flags for each variable to identify potential outliers. Deciding how these will be handled will happen in the Data Preparation Section.
Using this method, we created:
- znOUT: Potential Outliers in % Large Lots (zn)
- chasOUT: Potential Outliers in Borders Charles River (chas)
- rmOUT: Potential Outliers in Avg Rooms (rm)
- disOUT: Potential Outliers in Dist to Employment (dis)
- lstatOUT: Potential Outliers in Lower Status (lstat)
- medvOUT: Potential Outliers in Med Home Value (medv)
The table below shows the resulting values (TRUE is a potential outlier).
Note that this method did not identify potential outliers for the remaining variables.
## znOUT chasOUT rmOUT disOUT lstatOUT medvOUT
## FALSE 418 433 438 461 460 428
## TRUE 48 33 28 5 6 38
Before we move to Data Preparation, we can use corrplot() to see what
might have a significant correlation with target, or with another
variable. We will look only at the original variables, not those created
to flag potential outliers.
- Borders Charles River (chas) is not very strongly correlated to target
(or much else). As we saw above, 93% do not border the Charles.
- Avg Rooms (rm) is not highly correlated with target, but it does
correlates strongly with Med Home Value (medv) and has an inverse
correlation with Lower Status (lstat).
- In general, there is lots of positive and negative correlation among
the variables, so we should look for multicollinearity.
- Nitrogen Oxide (nox) has the strongest positive correlation to the
target, followed by % Built <1940 (age), Index: Access to Hwys (rad),
Dist to Employment (dis) and % NonRetail Business (indus).
- % Large Lots (zn) has the strongest negative correlation.
After exploring our data we move on to preparing it to be more useful for our goal of creating a model to predict if the crime rate is above the median or not.
Note that we have already determined that there are no missing values.
Note also that we created flags to identify potential outliers.
We will create a new dataframe, train2, for these transformations, so we can compare them to the original when we model.
From the exploration above, it looks like it might be valuable to take some of the continuous variables and make them into categorical ones.
Interested in trying this two different ways:
Zone Option 1 - a boolean - FALSE if 0% is zoned for large lots (zn=0),
otherwise TRUE. As the table shows, this results in 339 rows with FALSE
and 127 rows with TRUE.
## Mode FALSE TRUE
## logical 339 127
Zone Option 2 - a factor, where the values are grouped into ranges. As this table shows, most values (339) are in the ‘0’ bin, with 54, 31, and 42 in the ‘<=25%’, ‘26-50%’ and ‘51-100%’, respectively.
## 0 <=25% 26-50% 51-100%
## 339 54 31 42
We will create a variable (indus1) that groups % NonRetail Business (indus) into three categories: low, med and high, with the distribution shown in the table below.
## low med high
## 143 150 173
Similarly, we create new variable (tax1) that groups Tax Rate (tax) into three categories: low, med and high, with the distribution shown in the table below.
## low med high
## 154 186 126
Visualizing these new categories, showing the categories grouped by above and below crime rate (target).
Looking at the set of outlier flags created above, we want to calculate how many outliers each row has.
As the table shows, there is 1 row with 4 outliers and there are 5 rows with 3 outliers (of the 12 original predictor variables, 6 of which has potential outliers). For these 6 rows, either a third (4/12) or quarter (3/12) of the predictor variables are outliers. We will eliminate these 6 rows.
## sumOUT
## 0 1 2 3 4
## 351 79 30 5 1
As we saw above, Borders Charles River (chas) shows 93% of the neighborhoods do not border the Charles. And this variable correlated with none of the others, including target. We will remove it, and the outlier flag associated with it.
Avg Rooms (rm) has a strong positive correlation with Med Home Value (medv) and negative correlation with Lower Status (lstat). We take out Avg Rooms (rm), since it has a a weaker correlation with target than the other 2. We will also remove its outlier flag.
While we are at it, we will also remove the sumOUT field used to identify rows with multiple outliers.
To see where we are with the numeric fields. The only difference from
the correlation in the previous section is Med Home Value (medv) has a
slightly lower correlation and Nitrogen Oxide (nox) has a slightly high
one.
The first model we will tackle is a binary regression model from the untransformed data (the original data set). This is to be used as a baseline determine if our transformations may have had the desired effect on the regression model and strengthened it.
We will create the model, and then use stepAIC() with direction=both to refine it.
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## medv, family = binomial, data = train1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.415922 6.035013 -6.200 5.65e-10 ***
## zn -0.068648 0.032019 -2.144 0.03203 *
## nox 42.807768 6.678692 6.410 1.46e-10 ***
## age 0.032950 0.010951 3.009 0.00262 **
## dis 0.654896 0.214050 3.060 0.00222 **
## rad 0.725109 0.149788 4.841 1.29e-06 ***
## tax -0.007756 0.002653 -2.924 0.00346 **
## ptratio 0.323628 0.111390 2.905 0.00367 **
## medv 0.110472 0.035445 3.117 0.00183 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.32 on 457 degrees of freedom
## AIC: 215.32
##
## Number of Fisher Scoring iterations: 9
Model 1 starts with using all predictor variables and an AIC value of 218.05.
Next, it removes Avg Rooms (rm) and results in an AIC value of 216.71.
It goes through 3 more iterations, and ends up including the following variables: zn, nox, age, dis, rad, tax, ptratio and medv. The result has an AIC value of 215.32, a BIC of 252.62, an \(AIC_c\) of 215.72 and an pseudo \(R^2\) of 0.69.
Recall that in logistic regression, \(R^2\) does not have the same interpretation as in linear regression. Here we are using McFadden’s \(R^2\). This is not the percentage of variance explained by the logistic model, but rather a ratio indicating how close is the fit to being perfect or the worst. If the model is a good fit, the value will be close to 1.
We can see here the Null deviance (with only the intercept) has a deviance of 645.88 whereas the Residual deviance with all the variables is 192.05. So still high, but certainly closer to 0 with the variables.
Looking at the logit of odds of the intercept (-37.415922), we get a really small number (a decimal that starts with 16 zeros). This indicates the odds of having a crime rate higher than the median increases by almost 100% when we have predictor variables. Or, almost all of the odds based on the variables.
Looking at the Margin Model Plots we see good fits between the data and the model, with some issues with % Large Lots (zn) on the left, potential impact of outliers with Index: Access to Hwys (rad) and Tax rate (tax).
Model 2 will use the train2 data set with all the transformations above, except we will remove all the outlier flags. As with Model 1, we will then use stepAIC() with direction=both to refine it.
##
## Call:
## glm(formula = target ~ zn + indus + nox + dis + tax + lstat +
## medv + zn2 + indus1 + tax1, family = binomial, data = train2)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.400e+01 7.354e+00 -4.624 3.77e-06 ***
## zn 1.226e+00 2.400e-01 5.106 3.29e-07 ***
## indus -9.584e-01 1.852e-01 -5.175 2.28e-07 ***
## nox 5.755e+01 1.030e+01 5.587 2.31e-08 ***
## dis 8.573e-01 3.060e-01 2.802 0.005080 **
## tax -2.692e-02 6.682e-03 -4.029 5.61e-05 ***
## lstat 1.883e-01 6.847e-02 2.750 0.005954 **
## medv 1.881e-01 6.144e-02 3.061 0.002204 **
## zn2<=25% -2.359e+01 4.705e+00 -5.015 5.30e-07 ***
## zn226-50% -6.807e+01 3.087e+03 -0.022 0.982410
## zn251-100% -1.338e+02 1.746e+03 -0.077 0.938922
## indus1med 8.022e+00 1.486e+00 5.398 6.75e-08 ***
## indus1high 2.018e+01 3.620e+00 5.575 2.48e-08 ***
## tax1med 4.378e+00 1.051e+00 4.166 3.11e-05 ***
## tax1high 1.320e+01 3.395e+00 3.887 0.000101 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 637.62 on 459 degrees of freedom
## Residual deviance: 109.78 on 445 degrees of freedom
## AIC: 139.78
##
## Number of Fisher Scoring iterations: 20
Model 2 starts with using all predictor variables in our transformed data and an AIC value of 144.88. This is already an improvement over the untransformed data, which started at 218.05 and ended with 215.32.
Next it removes zn1, the boolean we created from % Large Lots (zn) and results in an AIC value of 143.04
It goes through 2 more iterations, and ends up including the following variables: zn, indus, nox, dis, tax, lstat, medv, zn2, indus1 and tax1. The final result has an AIC value of 139.78, a BIC of 201.75, an \(AIC_c\) of 140.86 and an pseudo \(R^2\) of 0.83.
We can see here the Null deviance (with only the intercept) has a deviance of 637.62 whereas the Residual deviance with all the variables is 109.78.
All of this shows improvements over Model 1.
Looking at the logit of odds of the intercept (-34.00), we get a really small number (a decimal that starts with 14 zeros). This indicates the odds of having a crime rate higher than the median increases by almost 100% when we have predictor variables. Or, almost all of the odds based on the variables.
In the Margin Model Plots we see generally good fits between the data and the model, with some separation of the data and model lines on the left in % NonRetail Business (indus). We also see the same issues with % Large Lots (zn) on the left that we saw in model 1, potential impact of outliers with Tax rate (tax).
The coefficients and confidence intervals in Model 2 can be interpreted as follows. We will use % of Large Lots (zn) in our example, but the same structure applies to all other predictor variables.
A 1 unit increase in % of Large Lots (zn) will result in 3.4 increase in the log-odds of the crime rate being above the median (success) vs the crime rate being below the median (failure). Since exp(3.4) is 29.96, this is a 2,896% increase in the odds.
We can say with 95% that \(\beta_0\)
there is NOT a significant difference in the probability of target=1 vs
target=0 at the intercept. Target = 1 is between (0,0) more likely than
target =0.
Note that we are rounding the very very small numbers here. The actual
CI is:
(0.0000000000000000000002007277, 0.0000000009124242)
% of Large Lots (zn) has a significant effect on the probability, with each unit of increase in % of Large Lots (zn), producing an increase in the odds of target by a factor of between (2.255561, 6.014934)
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 1.711957e-15 2.007277e-22 9.124242e-10
## zn 3.406073e+00 2.255561e+00 6.014934e+00
## indus 3.835002e-01 2.535931e-01 5.291857e-01
## nox 9.890647e+24 1.047587e+17 5.687067e+34
## dis 2.356786e+00 1.320708e+00 4.463804e+00
## tax 9.734384e-01 9.597484e-01 9.855687e-01
## lstat 1.207229e+00 1.059689e+00 1.390679e+00
## medv 1.206931e+00 1.077263e+00 1.374185e+00
## zn2<=25% 5.662022e-11 6.619600e-16 1.659356e-07
## zn226-50% 2.741229e-30 0.000000e+00 1.330340e+22
## zn251-100% 7.702101e-59 0.000000e+00 4.527671e-30
## indus1med 3.047015e+03 2.176195e+02 8.033263e+04
## indus1high 5.816216e+08 1.047436e+06 1.838360e+12
## tax1med 7.964049e+01 1.186983e+01 7.713198e+02
## tax1high 5.385323e+05 1.217929e+03 9.981182e+08
In the confusion matrix below we show:
- 223 accurately predicted 0s (crime not above median) AKA True
Negatives and 9 inaccurate 0s (AKA False Negatives - Type 2
error).
- 218 accurately predicted 1s (crime above median) AKA True Positives
and 10 inaccurate 1s (AKA False Positives - Type 1 error).
##
## pred 0 pred 1
## obs 0 223 10
## obs 1 9 218
Using the Confusion Matrix, we calculate the accuracy of our model is 95.87%. That is, the model is able to classify 95.87% of all observations.
Sensitivity (aka recall or true positive rate) indicates how often does our model predicts actual TRUE from the overall TRUE events. For Model 2 it is 96.04%.
Specificity or the true negative rate, indicating how often our model predicts actual FALSE from the overall FALSE events for Model 2 is 95.71%
The classification error rate for Model 2 is 4.13%, as expected since it is the inverse of Accuracy.
Precision, which indicates how often the predicted TRUE values are actually TRUE is 95.61% for Model 2.
F-Score is a harmonic mean of sensitivity (recall or true positive rate) and precision. The score value lies between 0 and 1. The value of 1 represents perfect balance between precision & recall. The value 0 represents the worst case. For Model 2, the F-Score is 0.96.
The area under the ROC Curve is an index of accuracy. ROC curve is a line plot that is drawn between TPR and TNR. The higher the area under the curve (AUC), the better the prediction power of the model. An AUC of 1 is a perfect predictive model. An AUC value of greater than .70 indicates a good model.
For Model 2, we have an AUC of 0.99.
##
## Call:
## roc.formula(formula = target ~ model2$fitted.values, data = train2, plot = TRUE, main = "ROC CURVE", col = "#6aaec8")
##
## Data: model2$fitted.values in 233 controls (target 0) < 227 cases (target 1).
## Area under the curve: 0.9901
We will using Cook’s Distance values to find influential values. Not all outliers are influential observations. To check whether the data contains potential influential observations, the standardized residual error can be inspected. Data points with an absolute standardized residuals above 3 represent possible outliers and may deserve closer attention.
When we plot Cook’s Distance, we identify the top 3 distances, as shown below, observations 14, 356 and 451. But when we filter for rows with an absolute standardized residuals above 3, we get 0 rows, so there are no influential outliers.
We use the Variance Inflation Factor (VIF) to check for multicollinearity.
Note that in the table below we renamed the column GVIF^(1/(2*Df)) as ‘result’ to make it easier to reference when we graph it.
% Large Lots (zn) stands out as problematic. Note that ithe model has both the original field, and a factorized version of the field (zn1). This will be addressed in Model 3.
## GVIF Df result
## zn 69.640547 1 8.345091
## indus 22.160892 1 4.707536
## nox 5.926183 1 2.434375
## dis 4.601422 1 2.145093
## tax 6.136864 1 2.477269
## lstat 2.351916 1 1.533596
## medv 3.532100 1 1.879388
## zn2 61.224581 3 1.985276
## indus1 65.927776 2 2.849490
## tax1 7.695479 2 1.665555
We have read some warnings about using the automated stepwise functions, so we thought we would try a model where we choose predictor variables to include by hand, using the information in the Data Exploration, Data Preparation and Model 2 Results section above. We also want to address the multicollinearity found with % Large Lots (zn) and the created field zn2 in Model 2.
After many many manual iterations of adding and subtracting, we land on the fields nox, tax, tax1, indus, indus1, rad, zn, and of course target.
We also modify zn to be a boolean based on if the % is <25%.
##
## Call:
## glm(formula = target ~ ., family = binomial, data = train3)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.310e+01 2.915e+00 -4.494 6.97e-06 ***
## nox 3.272e+01 5.970e+00 5.480 4.26e-08 ***
## tax -2.184e-02 5.212e-03 -4.190 2.79e-05 ***
## tax1med 2.727e+00 8.253e-01 3.304 0.000954 ***
## tax1high 4.495e+00 9.872e+00 0.455 0.648871
## indus -6.129e-01 1.607e-01 -3.815 0.000136 ***
## indus1med 3.538e+00 1.138e+00 3.109 0.001879 **
## indus1high 1.242e+01 3.160e+00 3.931 8.47e-05 ***
## rad 4.951e-01 2.264e-01 2.187 0.028768 *
## znTRUE -1.668e+01 1.016e+03 -0.016 0.986910
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 637.62 on 459 degrees of freedom
## Residual deviance: 155.06 on 450 degrees of freedom
## AIC: 175.06
##
## Number of Fisher Scoring iterations: 18
Model 3 a AIC of 175.06, a BIC of 216.37, an \(AIC_c\) of 175.55 and an pseudo \(R^2\) of 0.76.
We can see here the Null deviance (with only the intercept) has a deviance of 637.62 whereas the Residual deviance with all the variables is 155.06.
All of this shows improvements over Model 1, but not over Model 2. However, Model to had some problematic multicolinearity, which makes it’s results suspect.
Looking at the logit of odds of the intercept (-13.10), we get a really small number (a decimal that starts with 5 zeros). This indicates the odds of having a crime rate higher than the median increases by almost 100% when we have predictor variables. Or, almost all of the odds based on the variables.
Looking at the Margin Model Plots we see generally good fits between the data and the model, with some separation of the data and model lines on the left in % NonRetail Business (indus). % Large Lots (zn) is now a boolean.
## Error in names(dat) <- object$term :
## 'names' attribute [1] must be the same length as the vector [0]
## Error in names(dat) <- object$term :
## 'names' attribute [1] must be the same length as the vector [0]
The coefficients and confidence intervals in Model 3 can be interpreted as follows. We will use Index: Access to Hwys (rad) in our example, but the same structure applies to all other predictor variables.
Coefficients:
A 1 unit increase in Index: Access to Hwys (rad) will result in 1.64
increase in the log-odds of the crime rate being above the median
(success) vs the crime rate being below the median (failure). Since
exp(1.64) is 5.155, this is a 4155% increase in the odds.
Confidence Interval (CI):
We can say with 95% that \(\beta_0\)
there is NOT a significant difference in the probability of target=1 vs
target=0 at the intercept. Target = 1 is between (0,0) more likely than
target =0.
Note that we are rounding the very very small numbers here. The actual
CI is:
( 0.000000002864240, 0.0003353086)
Index: Access to Hwys (rad) has a significant effect on the probability, with each unit of increase in Index: Access to Hwys (rad), producing an increase in the odds of target by a factor of between (1.131544, 2.789003)
## Waiting for profiling to be done...
## OR 2.5 % 97.5 %
## (Intercept) 2.043993e-06 2.864240e-09 3.353086e-04
## nox 1.616852e+14 6.231542e+09 1.626052e+20
## tax 9.783997e-01 9.675599e-01 9.878811e-01
## tax1med 1.528011e+01 3.263914e+00 8.585769e+01
## tax1high 8.959064e+01 2.443073e-04 7.459579e+05
## indus 5.417722e-01 3.864839e-01 7.286074e-01
## indus1med 3.438522e+01 4.177222e+00 3.748291e+02
## indus1high 2.479456e+05 7.469731e+02 1.986877e+08
## rad 1.640590e+00 1.131544e+00 2.789003e+00
## znTRUE 5.724841e-08 1.485107e-137 3.237840e+14
In the confusion matrix below we show:
- 218 accurately predicted 0s (crime not above median) AKA True
Negatives and 11 inaccurate 0s (AKA False Negatives - Type 2
error).
- 216 accurately predicted 1s (crime above median) AKA True Positives
and 15 inaccurate 1s (AKA False Positives - Type 1 error).
##
## pred 0 pred 1
## obs 0 218 15
## obs 1 11 216
From Confusion Matrix, the accuracy of our model is 94.35%. That is, the model is able to classify 94.35% of all observations.
Sensitivity (aka recall or true positive rate) indicates how often does our model predicts actual TRUE from the overall TRUE events. For Model 3 it is 95.15%.
Specificity or the true negative rate, indicating how often our model predicts actual FALSE from the overall FALSE events for Model 3 is 93.56%
The classification error rate for Model 3 is 5.65%, as expected since it is the inverse of Accuracy.
Precision, which indicates how often the predicted TRUE values are actually TRUE is 93.51% for Model 3.
F-Score is a harmonic mean of sensitivity (recall or true positive rate) and precision. The score value lies between 0 and 1. The value of 1 represents perfect balance between precision & recall. The value 0 represents the worst case. For Model 3, the F-Score is 0.94.
The area under the ROC Curve is an index of accuracy. ROC curve is a line plot that is drawn between TPR and TNR. The higher the area under the curve (AUC), the better the prediction power of the model. An AUC of 1 is a perfect predictive model. An AUC value of greater than .70 indicates a good model.
For Model 3, we have an AUC of 0.98.
##
## Call:
## roc.formula(formula = target ~ model3$fitted.values, data = train3, plot = TRUE, main = "ROC CURVE", col = "#6aaec8")
##
## Data: model3$fitted.values in 233 controls (target 0) < 227 cases (target 1).
## Area under the curve: 0.9798
We will using Cook’s Distance values to find influential values. Not all outliers are influential observations. To check whether the data contains potential influential observations, the standardized residual error can be inspected. Data points with an absolute standardized residuals above 3 represent possible outliers and may deserve closer attention.
When we plot Cook’s Distance, we identify the top 3 distances, as shown below, observations 14, 56 and 257. But when we filter for rows with an absolute standardized residuals above 3, we get 0 rows, so there are no influential outliers.
We again use the Variance Inflation Factor (VIF) to check for multicollinearity.
Recall that in the table below we renamed the column GVIF^(1/(2*Df)) as ‘result’ to make it easier to reference when we graph it.
As you can see, the changes in the model have addressed the multicollinearity found in Model 2.
## GVIF Df result
## nox 2.061370 1 1.435747
## tax 3.676886 1 1.917521
## tax1 3.732575 2 1.389959
## indus 22.036334 1 4.694287
## indus1 41.603662 2 2.539703
## rad 1.928874 1 1.388839
## zn 1.000000 1 1.000000
Though Model 2 showed “better” numbers in the results, there was an issue with % Large Lots (zn) when we checked for multicollinearity. There is an assumption of Logistic Models that there will be an absence of multicollinearity. Model 3 addressed this issue. The Model 3 results numbers, though lower, still showed the model was valid. Therefore, we selected Model 3.
We apply the same transformations on the evaluation data that ended up being used in our selected model: - bins for Tax Rate (tax) - bins for % NonRetail Business (indus) - transform % Large Lots (zn) into a Boolean where T >=25
We then adjust the results using 0.5 threshold.
See attached csv for results.
library(tidyverse)
library(skimr) #skim
library(corrplot) #corrplot
library(ggpubr) #ggarrange
library(MASS) #stepAIC
library(car) #vif()
library(pROC) #for AUC
library(broom) #augment()
library(gamlr) #AICc()
library(outliers) #testing for outliers
library(qdapTools) #boolean table of outliers
library(tigerstats) #rowPerc
#load training data
train1 <- read_csv("/Users/dianaplunkett/CUNY/621 Business Analytics and Data Mining/Homework 3/crime-training-data_modified.csv", show_col_types = FALSE)
#load evaluation data
eval1 <- read_csv("/Users/dianaplunkett/CUNY/621 Business Analytics and Data Mining/Homework 3/crime-evaluation-data_modified.csv", show_col_types = FALSE)
#take a peak at the data
glimpse(train1)
# 1.Data Exploration
#Data Skimming
skim(train1)
#Binariness
#table showing count of values (0,1) for variable target
xtabs (~target, train1)
#Histograms
#Skim histos are small, so looking at some a bit closer
p1 <- train1 %>%
ggplot (aes(x=zn))+
geom_histogram(binwidth=1, fill ="#6aaec8")+
labs(x = "% Large Lots") +
theme_classic()
p2 <- train1 %>%
ggplot (aes(x=indus))+
geom_histogram(binwidth=5, fill ="#6aaec8")+
labs(x = "% NonRetail Business") +
theme_classic()
p3 <- train1 %>%
ggplot (aes(x=age))+
geom_histogram(binwidth=5, fill ="#6aaec8")+
labs(x = "% Built <1940") +
theme_classic()
p4 <- train1 %>%
ggplot (aes(x=dis))+
geom_histogram(binwidth=1, fill ="#6aaec8")+
labs(x = "Dist to Employment") +
theme_classic()
p5 <- train1 %>%
ggplot (aes(x=tax))+
geom_histogram(binwidth=25, fill ="#6aaec8")+
labs(x = "Tax Rate") +
theme_classic()
p6 <- train1 %>%
ggplot (aes(x=ptratio))+
geom_histogram(binwidth=1, fill ="#6aaec8")+
labs(x = "Pupil/Teacher") +
theme_classic()
ggarrange(p1,p2,p3,p4,p5,p6)
#A quick table to see just how many rows in zn are 0
summary(if_else(
train1$zn ==0,TRUE, FALSE))
#Box Plots
theColors <- c('#6aaec8','#ef8300')
par(mar = c(3, 4, 1, 2))
par(mfrow = c(3,4))
boxplot(zn~target, ylab="% Large Lots",
col=theColors, data = train1)
boxplot(indus~target, ylab="% NonRetail Business",
col=theColors,data = train1)
boxplot(chas~target, ylab="Borders Charles River",
col=theColors,data = train1)
boxplot(nox~target, ylab="Nitrogen Oxide",
col=theColors,data = train1)
boxplot(rm~target, ylab="Avg Rooms",
col=theColors,data = train1)
boxplot(age~target, ylab="% Built <1940",
col=theColors,data = train1)
boxplot(dis~target, ylab="Dist to Employment",
col=theColors,data = train1)
boxplot(rad~target, ylab="Index: Access to Hwys",
col=theColors,data = train1)
boxplot(tax~target, ylab="Tax Rate",
col=theColors,data = train1)
boxplot(ptratio~target, ylab="Pupil/Teacher",
col=theColors,data = train1)
boxplot(lstat~target, ylab="Lower Status",
col=theColors,data = train1)
legend(-.75,2, legend=c("Below Med", "Above Med"), col=theColors,lwd = 5, xpd = TRUE, horiz = FALSE, cex = .75, seg.len=.25, bty = 'n', title="Crime Rate")
boxplot(medv~target, ylab="Med Home Value",
col=theColors,data = train1)
#table showing count of values (0,1) for variable chas
round(rowPerc(xtabs (~chas, train1)),0)
#table showing count of values (0,1) for variable chas
round(rowPerc(xtabs (~target+chas, train1)),0)
#Jitter plots
p7 <- ggplot(train1, aes(zn, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "% Large Lots") +
theme_classic(base_size = 8)
p8 <- ggplot(train1, aes(indus, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "% NonRetail Business") +
theme_classic(base_size = 8)
p9 <- ggplot(train1, aes(chas, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "Borders Charles River") +
theme_classic(base_size = 8)
p10 <- ggplot(train1, aes(nox, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "Nitrogen Oxide") +
theme_classic(base_size = 8)
p11 <- ggplot(train1, aes(rm, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "Avg Rooms") +
theme_classic(base_size = 8)
p12 <- ggplot(train1, aes(age, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "% Built <1940") +
theme_classic(base_size = 8)
p13<- ggplot(train1, aes(dis, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "Dist to Employment") +
theme_classic(base_size = 8)
p14<- ggplot(train1, aes(rad, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "Index: Access to Hwys") +
theme_classic(base_size = 8)
p15<-ggplot(train1, aes(tax, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "Tax Rate") +
theme_classic(base_size = 8)
p16 <- ggplot(train1, aes(ptratio, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "Pupil/Teacher") +
theme_classic(base_size = 8)
p17<-ggplot(train1, aes(lstat, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "Lower Status") +
theme_classic(base_size = 8)
p18<-ggplot(train1, aes(medv, target, color=ifelse(target==0,"#6aaec8","#ef8300")))+
geom_jitter(width = 0.1, height=0.1, alpha=0.3)+
scale_color_identity()+
labs(x = "Med Home Value")+
theme_classic(base_size = 8)
ggarrange(p7,p8,p9,p10,p11,p12,p13,p14,p15,p16,p17,p18
#,common.legend = TRUE <- this doesn't work
)
#Outliers
#create new variable to flag potential outliers
train1$znOUT<-scores(train1$zn, type='iqr', lim=1.5)
train1$chasOUT<-scores(train1$chas, type='iqr', lim=1.5)
train1$rmOUT<-scores(train1$rm, type='iqr', lim=1.5)
train1$disOUT<-scores(train1$dis, type='iqr', lim=1.5)
train1$lstatOUT<-scores(train1$lstat, type='iqr', lim=1.5)
train1$medvOUT<-scores(train1$medv, type='iqr', lim=1.5)
#using library(qdapTools) from qdap package to show table with the values of all the OUT variables
t(mtabulate(train1[grep('OUT',names(train1))]))
# Correlation
# want to exclude the new OUTlier variables.
M <- train1[grep('OUT',names(train1), invert = TRUE)]%>%cor()
corrplot.mixed(M, tl.srt = 45, tl.cex = 0.5, na.label = "square", na.label.col = "lightgrey", tl.col = 'black', number.cex = 0.5)
#Data Prep
#make a new df so we can run models on original and transformed data
train2 <- train1
#zn1 is the boolean option for zn
train2 <- train2 %>% mutate (zn1 = if_else(
zn ==0,FALSE, TRUE))
summary(train2$zn1)
#zn2 is the binned option for zn
train2$zn2 <- cut(train2$zn, c(-1,1,25,50,100), labels=c('0','<=25%','26-50%','51-100%'))
summary(train2$zn2)
#bin % NonRetail Business
train2$indus1 <- cut(train2$indus, c(0,6,18,30), labels = c('low','med', 'high'))
summary(train2$indus1)
#bin tax rate
train2$tax1 <- cut(train2$tax, c(0,300,600,800),
labels = c('low','med', 'high')
)
summary(train2$tax1)
#group & viz for zn1
zn1 <- train2 %>% group_by(zn1, target) %>% summarise(counts=n())
p19 <- ggplot(zn1, aes(fill=as.factor(target), x = zn1, y = counts, )) +
geom_bar(position="dodge", stat='identity' ) +
scale_fill_manual(values = c("#6aaec8", "#ef8300"), labels=c('Below Med', 'Above Med'))+
labs(fill='Crime Rate', x='Zoned for large lots?')+
theme_classic(base_size = 8)+
theme(legend.position="none")
#group & viz for zn2
zn2 <- train2 %>% group_by(zn2, target) %>% summarise(counts=n())
p20 <- ggplot(zn2, aes(fill=as.factor(target), x = zn2, y = counts, )) +
geom_bar(position="dodge", stat='identity' ) +
scale_fill_manual(values = c("#6aaec8", "#ef8300"), labels=c('Below Med', 'Above Med'))+
labs(fill='Crime', x=' % Zoned for Large Lots')+
theme_classic(base_size = 8)+
theme(legend.position="none")
#group & viz for indus1
indus1 <- train2 %>% group_by(indus1, target) %>% summarise(counts=n())
p21 <- ggplot(indus1, aes(fill=as.factor(target),x= indus1, y = counts, )) +
geom_bar(position="dodge", stat='identity' ) +
scale_fill_manual(values = c("#6aaec8", "#ef8300"), labels=c('Below Med', 'Above Med'))+
labs(fill='Crime', x=' % NonRetail Business')+
theme_classic(base_size = 8)+
theme(legend.position="none")
#group & viz for tax1
tax1 <- train2 %>% group_by(tax1, target) %>% summarise(counts=n())
p22 <- ggplot(tax1, aes(fill=as.factor(target), x = tax1, y = counts, )) +
geom_bar(position="dodge", stat='identity' ) +
scale_fill_manual(values = c("#6aaec8", "#ef8300"), labels=c('Below Med', 'Above Med'))+
labs(fill='Crime', x=' Tax Rate')+
theme_classic(base_size = 8)
#plot all 4 new category variables
ggarrange(p19, p20, p21,p22)
#calc the number of outliers in each row and save value in sumOUT
train2$sumOUT <-rowSums(train2[grep('OUT',names(train2))])
xtabs (~sumOUT, train2)
#eliminate rows with outliers in 3 or more fields.
train2 <-subset(train2,sumOUT<=2)
train2<-subset(train2,select=-c(chas,chasOUT,rm,rmOUT, sumOUT))
# Correlation
# want to exclude the OUTlier variables and the factors we created, as they are not numeric
M2 <- select_if(train2, is.numeric)%>%cor()
corrplot.mixed(M2, tl.srt = 45, tl.cex = 0.5, na.label = "square", na.label.col = "lightgrey", tl.col = 'black', number.cex = 0.5)
#Note that first we have to remove the outlier flags
train1<-train1[grep('OUT',names(train1), invert = TRUE)]
#model1 on original data
model1 <- glm(target ~ ., family = binomial, data = train1)
#model1.2 using stepAIC on model1 (original data)
model1.2 <- stepAIC(model1, direction='both',trace=0)
summary(model1.2)
BIC(model1.2)
AICc(model1.2)
# Computation of the R^2 with a function
r2Log <- function(model) {
summaryLog <- summary(model)
1 - summaryLog$deviance / summaryLog$null.deviance
}
r2Log(model1.2)
#Looking at the logit of odds of the intercept (-37.415922)
format(exp(-37.415922), scientific = FALSE)
mmps(model1.2)
#Note that first we have to remove the outlier flags
train2<-train2[grep('OUT',names(train2), invert = TRUE)]
#model2 on transformed data
model2 <- glm(target ~ ., family = binomial, data = train2)
summary(model2)
#model2.2 using stepAIC on model1 (original data)
model2.2 <- stepAIC(model2, direction='both', trace = 0)
summary(model2.2)
BIC(model2.2)
AICc(model2.2)
r2Log(model2.2)
#Looking at the logit of odds of the intercept (-34.00)
format(exp(-34.00), scientific = FALSE)
#Marginal Model Plots
mmps(model2.2)
#coefficients and confidence intervals
exp(cbind(OR = coef(model2.2), confint(model2.2)))
#Confusion Matrix
#classify the prediction as 1 (crime above med) if the fitted value exceeds 0.5 otherwise 0 (crime below).
train2$predict2 <- ifelse(model2.2$fitted.values >0.5,1,0)
model2table <-table(train2$target, train2$predict2)
rownames(model2table)<-c('obs 0', 'obs 1')
colnames(model2table)<-c('pred 0', 'pred 1')
model2table
# Accuracy
(eff2<- sum(diag(model2table))/sum(model2table)*100)
# Sensitivity
(TPR2<- (model2table[2,2]/sum(model2table[2,]))*100)
# Specificity
(TNR2<- (model2table[1,1]/sum(model2table[1,]))*100)
# Classification Error Rate
cer2 <- 100 - eff2 #eff2=accuracy
# Precision
PREC2<- (model2table[2,2]/sum(model2table[,2]))*100
# F-Score
F_Score2 <- (2 * PREC2 * TPR2 / (PREC2 + TPR2))/100
# ROC Curve to get AUC
roc(target~model2$fitted.values, data = train2, plot = TRUE, main = "ROC CURVE", col= "#6aaec8")
#Influential values
#Cook's Distance
plot(model2, which = 4, id.n=3)
# Extract model results
cooks2 <- augment(model2) %>%
mutate(index = 1:n())
#filter for rows with an absolute standardized residuals above 3
cooks2 %>% filter(abs(.std.resid) > 3)
# Multicollinearity
vif2<-vif(model2.2)
colnames(vif2)[3]<-'result'
vif2<-as.data.frame(vif2)
(vif2)
barplot(vif2$result, names=rownames(vif2), main = "VIF Values", cex.names =.75, col = "#6aaec8")
#add horizontal line at 5
abline(h = 5, lwd = 3, lty = 2)
train3 <- subset(train2,
select=c(nox, tax,tax1, indus, indus1,rad, zn,target))
#update zn to a Boolean option based on <25
train3 <- train3 %>% mutate (zn = if_else(
zn <25,FALSE, TRUE))
model3 <- glm(target ~ ., family = binomial, data = train3)
summary(model3)
BIC(model3)
AICc(model3)
r2Log(model3)
#Looking at the logit of odds of the intercept (-13.10)
format(exp(-13.10), scientific = FALSE)
#Marginal Model Plots
mmps(model3)
#coefficients and confidence intervals
exp(cbind(OR = coef(model3), confint(model3)))
#Confusion Matrix
#classify the prediction as 1 (crime above med) if the fitted value exceeds 0.5 otherwise 0 (crime below).
train3$predict3 <- ifelse(model3$fitted.values >0.5,1,0)
model3table <-table(train3$target, train3$predict3)
rownames(model3table)<-c('obs 0', 'obs 1')
colnames(model3table)<-c('pred 0', 'pred 1')
model3table
# Accuracy
(eff3<- sum(diag(model3table))/sum(model3table)*100)
# Sensitivity
(TPR3<- (model3table[2,2]/sum(model3table[2,]))*100)
# Specificity
(TNR3<- (model3table[1,1]/sum(model3table[1,]))*100)
# Classification Error Rate
(cer3 <- 100 - eff3) #eff2=accuracy
# Precision
PREC3<- (model3table[2,2]/sum(model3table[,2]))*100
# F-Score
F_Score3 <- (2 * PREC3 * TPR3 / (PREC3 + TPR3))/100
# ROC Curve to get AUC
roc(target~model3$fitted.values, data = train3, plot = TRUE, main = "ROC CURVE", col= "#6aaec8")
#Influential values
#Cook's Distance
plot(model3, which = 4, id.n=3)
# Extract model results
cooks3 <- augment(model3) %>%
mutate(index = 1:n())
#filter for rows with an absolute standardized residuals above 3
cooks3 %>% filter(abs(.std.resid) > 3)
# Multicollinearity
vif3<-vif(model3)
colnames(vif3)[3]<-'result'
vif3<-as.data.frame(vif3)
(vif3)
barplot(vif3$result, names=rownames(vif3), main = "VIF Values", cex.names =.75, col = "#6aaec8", ylim = c(0, 5))
#add horizontal line at 5
abline(h = 5, lwd = 3, lty = 2)
#transform the data as needed for model
#bin tax rate
eval1$tax1 <- cut(eval1$tax, c(0,300,600,800),
labels = c('low','med', 'high')
)
#bin % NonRetail Business
eval1$indus1 <- cut(eval1$indus, c(0,6,18,30), labels = c('low','med', 'high'))
#update zn to a Boolean option based on <25
eval1 <- eval1 %>% mutate (zn = if_else(
zn <25,FALSE, TRUE))
#predict
eval_results <- predict(model3, eval1, type ='response')
#adjust using 0.5 threshold
eval_results <- ifelse(eval_results > 0.5, 1, 0)
#write csv for submission
write.csv(as.data.frame(eval_results), "/Users/dianaplunkett/CUNY/621 Business Analytics and Data Mining/Homework 3/HW3_GRP5.csv", row.names=TRUE)