In this homework assignment, you will explore, analyze and model a data set containing information on crime for various neighborhoods of a major city. Each record has a response variable indicating whether or not the crime rate is above the median crime rate (1) or not (0).
Your objective is to build a binary logistic regression model on the training data set to predict whether the neighborhood will be at risk for high crime levels. You will provide classifications and probabilities for the evaluation data set using your binary logistic regression model. You can only use the variables given to you (or, variables that you derive from the variables provided).
Below is a short description of the variables of interest in the data set:
| Column | Description |
|---|---|
zn |
proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable) |
indus |
proportion of non-retail business acres per suburb (predictor variable) |
chas |
a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable) |
nox |
nitrogen oxides concentration (parts per 10 million) (predictor variable) |
rm |
average number of rooms per dwelling (predictor variable) |
age |
proportion of owner-occupied units built prior to 1940 (predictor variable) |
dis |
weighted mean of distances to five Boston employment centers (predictor variable) |
rad |
index of accessibility to radial highways (predictor variable) |
tax |
full-value property-tax rate per $10,000 (predictor variable) |
ptratio |
pupil-teacher ratio by town (predictor variable) |
lstat |
lower status of the population (percent) (predictor variable) |
medv |
median value of owner-occupied homes in $1000s (predictor variable) |
target |
whether the crime rate is above the median crime rate (1) or not (0) (response variable) |
## [1] 466 13
The dataset consists of 466 observations of 13 variables. There are
12 predictor variables and one response variable
(target).
## Rows: 466
## Columns: 13
## $ zn <dbl> 0, 0, 0, 30, 0, 0, 0, 0, 0, 80, 22, 0, 0, 22, 0, 0, 100, 20, 0…
## $ indus <dbl> 19.58, 19.58, 18.10, 4.93, 2.46, 8.56, 18.10, 18.10, 5.19, 3.6…
## $ chas <int> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ nox <dbl> 0.605, 0.871, 0.740, 0.428, 0.488, 0.520, 0.693, 0.693, 0.515,…
## $ rm <dbl> 7.929, 5.403, 6.485, 6.393, 7.155, 6.781, 5.453, 4.519, 6.316,…
## $ age <dbl> 96.2, 100.0, 100.0, 7.8, 92.2, 71.3, 100.0, 100.0, 38.1, 19.1,…
## $ dis <dbl> 2.0459, 1.3216, 1.9784, 7.0355, 2.7006, 2.8561, 1.4896, 1.6582…
## $ rad <int> 5, 5, 24, 6, 3, 5, 24, 24, 5, 1, 7, 5, 24, 7, 3, 3, 5, 5, 24, …
## $ tax <int> 403, 403, 666, 300, 193, 384, 666, 666, 224, 315, 330, 398, 66…
## $ ptratio <dbl> 14.7, 14.7, 20.2, 16.6, 17.8, 20.9, 20.2, 20.2, 20.2, 16.4, 19…
## $ lstat <dbl> 3.70, 26.82, 18.85, 5.19, 4.82, 7.67, 30.59, 36.98, 5.68, 9.25…
## $ medv <dbl> 50.0, 13.4, 15.4, 23.7, 37.9, 26.5, 5.0, 7.0, 22.2, 20.9, 24.8…
## $ target <int> 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 1, 0,…
All of the columns in the dataset are numeric, but the predictor
variable chas is a dummy variable, as is the response
variable target. We re-code them as factors.
Let’s take a look at the summary statistics for the variables in the dataset.
## zn indus chas nox rm
## Min. : 0.00 Min. : 0.460 0:433 Min. :0.3890 Min. :3.863
## 1st Qu.: 0.00 1st Qu.: 5.145 1: 33 1st Qu.:0.4480 1st Qu.:5.887
## Median : 0.00 Median : 9.690 Median :0.5380 Median :6.210
## Mean : 11.58 Mean :11.105 Mean :0.5543 Mean :6.291
## 3rd Qu.: 16.25 3rd Qu.:18.100 3rd Qu.:0.6240 3rd Qu.:6.630
## Max. :100.00 Max. :27.740 Max. :0.8710 Max. :8.780
## age dis rad tax
## Min. : 2.90 Min. : 1.130 Min. : 1.00 Min. :187.0
## 1st Qu.: 43.88 1st Qu.: 2.101 1st Qu.: 4.00 1st Qu.:281.0
## Median : 77.15 Median : 3.191 Median : 5.00 Median :334.5
## Mean : 68.37 Mean : 3.796 Mean : 9.53 Mean :409.5
## 3rd Qu.: 94.10 3rd Qu.: 5.215 3rd Qu.:24.00 3rd Qu.:666.0
## Max. :100.00 Max. :12.127 Max. :24.00 Max. :711.0
## ptratio lstat medv target
## Min. :12.6 Min. : 1.730 Min. : 5.00 0:237
## 1st Qu.:16.9 1st Qu.: 7.043 1st Qu.:17.02 1:229
## Median :18.9 Median :11.350 Median :21.20
## Mean :18.4 Mean :12.631 Mean :22.59
## 3rd Qu.:20.2 3rd Qu.:16.930 3rd Qu.:25.00
## Max. :22.0 Max. :37.970 Max. :50.00
## zn indus nox rm age dis
## 23.3646511 6.8458549 0.1166667 0.7048513 28.3213784 2.1069496
## rad tax ptratio lstat medv
## 8.6859272 167.9000887 2.1968447 7.1018907 9.2396814
We can see the mean, median, standard deviations, etc. for each of the variables in the dataset.
For the target variable, there are 229 instances where
crime level is above the median level (target = 1) and 237
instances where crime is not above the median level (target
= 0).
The minimum, first quantile, and median values for zn
are all 0. This variable refers to the proportion of residential land
zoned for large lots. Most of the neighborhoods in this dataset do not
have land that is zoned for large lots.
For the age variable, the median is higher than the
mean. This would indicate a skew to the left in the data, and that there
are greater proportions of homes that built prior to 1940.
There do not appear to be any missing values to address. Let’s validate this.
## [1] 0
There are in fact no missing values in the dataset.
To check whether the predictor variables are correlated to the target variable, we produce a correlation funnel that visualizes the strength of the relationships between our predictors and our response.
The correlation funnel plots the most important features towards the
top. In our dataset, the four most important features correlated with
the response variable are tax, indus,
ptratio, and nox.
Looking at the features towards the bottom, the four least important
features correlated with the response variable are medv,
lstat, rm, and chas, with
chas being the least correlated to target.
These correlations are measured by the Pearson Correlation coefficient
by default.
Since both chas and target are binary
categorical variables, the correct coefficient to use to understand the
strength of their relationship is actually the \(\phi\) coefficient. If either of these
categorical variables had more than two categories, we would need to
calculate \(\phi\) using the formula
for Cramer’s V (also called Cramer’s \(\phi\)) coefficient. However, in the
special case that both categorical variables are binary, the value of
Cramer’s V coefficient will actually be equal to the value of the
Pearson Correlation coefficient. So either formula actually results in
the same value for \(\phi\). We prove
this below.
## [1] TRUE
The value for \(\phi\) is 0.08004
regardless of the formula used to calculate it, and this very low value
proves the very small amount of correlation between chas
and target estimated by the correlation funnel is
accurate.
Now we check for multicollinearity between predictor variables.
It is clear that the predictor variables rad and
tax are very highly correlated (more than 0.9). We will
attempt to account for this when we prepare the data.
There are some smaller, but still high correlations between other
predictor variables as well. indus is highly correlated
(more than 0.7) with nox, dis, and
tax. nox is also highly correlated with
age and dis. rm is highly
correlated with medv. lstat is highly
correlated with medv.
Let’s take a look at the distributions for the predictor variables.
The distribution for rm appears to be normal, and the
distribution for medv is nearly normal. The distributions
for zn, dis, lstat, and
nox are right-skewed. The distributions for
age and ptratio are left-skewed.
The distributions for the remaining variables are multimodal,
including the distribution for chas, which appears
degenerate at first glance. It looks like a near-zero variance
predictor, which we can confirm using the nearZeroVar
function from the caret package.
| freqRatio | percentUnique | zeroVar | nzv | |
|---|---|---|---|---|
| zn | 16.142857 | 5.5793991 | FALSE | FALSE |
| indus | 4.321429 | 15.6652361 | FALSE | FALSE |
| chas | 13.121212 | 0.4291845 | FALSE | FALSE |
| nox | 1.176471 | 16.9527897 | FALSE | FALSE |
| rm | 1.000000 | 89.9141631 | FALSE | FALSE |
| age | 10.500000 | 71.4592275 | FALSE | FALSE |
| dis | 1.000000 | 81.5450644 | FALSE | FALSE |
| rad | 1.110092 | 1.9313305 | FALSE | FALSE |
| tax | 3.457143 | 13.5193133 | FALSE | FALSE |
| ptratio | 4.000000 | 9.8712446 | FALSE | FALSE |
| lstat | 1.000000 | 90.9871245 | FALSE | FALSE |
| medv | 2.142857 | 46.7811159 | FALSE | FALSE |
The percentage of unique values, percentUnique, in the
sample for this predictor is less than the typical threshold of 10
percent, but there is a second criterion to consider: the
freqRatio. This measures the frequency of the most common
value (0 in this case) to the frequency of the second most common value
(1 in this case). The freqRatio value for this predictor is
less than the typical threshold of 19 (i.e. 95 occurrences of the most
frequent value for every 5 occurrences of the second most frequent
value). So it is not considered a near-zero variance predictor. Neither
are any of the other predictors.
Next we analyze boxplots to determine the spread of the numeric predictor variables. This will also reveal any outliers.
For certain predictors, the variance between the two categories of
the response variable differs largely: age,
dis, nox, rad, and
tax.
We confirmed earlier that there are no missing values to impute and no near-zero variance variables to remove.
We check whether our predictor variables with skewed distributions would benefit from transformations.
| Skewed Variable | Ideal Lambda Proposed by Box-Cox | Reasonable Alternative Transformation |
|---|---|---|
| zn | -0.3 | log |
| dis | -0.15 | log |
| lstat | 0.25 | log |
| nox | -0.95 | inverse |
| age | 1.3 | no transformation |
| ptratio | 2 | square |
Some of the skewed variables might benefit from transformations, so we create transformed versions of our train and test data. We will incorporate these transformed predictors into one of our models in the next section: Model 2.
We now consider modifying the rad and tax
variables in an attempt to minimize the very high correlation we
identified earlier between these two predictors. First, we look at the
rad distribution more closely.
It is apparent that this variable may be more useful as a dummy
variable given the limited potential values for this predictor. We bin
the values and create the rad_less5 dummy variable
indicating 1 if the value for rad is < 5 and 0 if not.
Using the same logic, we also bin the values in the tax
variable into three levels (low is < 300, middle is 300 to 400, high
is 400+), then create two dummy variables: middle_tax and
high_tax.
Both binned dummy columns address the correlation concerns that were previously identified. We will attempt to incorporate these binned predictors into one of our models in the next section: Model 3.
Before we build the models, we have decided that none of the models
will be trained on the variable chas, as we have deemed
this predictor irrelevant based on the earlier conclusion that its
correlation with the response variable is very low.
We build three models. Model 1 will be trained using untransformed data, Model 2 will be trained using transformed data for skewed predictors, and Model 3 will be trained using binned data for highly correlated predictors. For Models 1 and 2, stepwise model selection will be used to pick the model with the lowest AIC. For Model 3, an alternative exhaustive search method will be used to pick two sub-models based on different criteria: one with the lowest AIC and one with the lowest BIC.
We start with a full model using the untransformed data and then
perform stepwise model selection to select the model with the smallest
AIC value using the stepAIC() function from the
MASS package.
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## medv, family = "binomial", data = train_df)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8295 -0.1752 -0.0021 0.0032 3.4191
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -37.415922 6.035013 -6.200 5.65e-10 ***
## zn -0.068648 0.032019 -2.144 0.03203 *
## nox 42.807768 6.678692 6.410 1.46e-10 ***
## age 0.032950 0.010951 3.009 0.00262 **
## dis 0.654896 0.214050 3.060 0.00222 **
## rad 0.725109 0.149788 4.841 1.29e-06 ***
## tax -0.007756 0.002653 -2.924 0.00346 **
## ptratio 0.323628 0.111390 2.905 0.00367 **
## medv 0.110472 0.035445 3.117 0.00183 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 197.32 on 457 degrees of freedom
## AIC: 215.32
##
## Number of Fisher Scoring iterations: 9
Model 1, the stepwise untransformed model, consists of 8 predictor variables and has an AIC of 215.32.
To interpret the model coefficients other than the Intercept, we first need to exponentiate them.
| Feature | Coefficient |
|---|---|
| zn | 9.336551e-01 |
| nox | 3.901012e+18 |
| age | 1.033499e+00 |
| dis | 1.924943e+00 |
| rad | 2.064956e+00 |
| tax | 9.922739e-01 |
| ptratio | 1.382133e+00 |
| medv | 1.116805e+00 |
The coefficients are now easier to interpret. Features with coefficients less than 1 indicate the odds of the crime rate being above the median crime rate decrease as that feature increases, while coefficients greater than 1 indicate the odds of the crime rate being above the median crime rate increases as that feature increases. How much the odds increase or decrease per 1 unit increase in the feature is the difference between that feature’s coefficient and 1, multiplied by 100 so we can understand it as a percentage increase or decrease.
The coefficient for nox is extremely large. If we look
back at the boxplots, we can see that the spreads for each category of
the target value have no overlap in their interquartile
ranges for this predictor. Almost all measures greater than 0.5 are
associated with above median crime rates, and since this variable is
measured on a scale less than 1, an increase of 1 in its value makes any
measurement a large enough value to associate it with above median crime
rates.
Now that we understand that, let’s exclude nox from the
features so that we can look at the rest of their coefficients more
closely.
| Feature | Coefficient | Percentage Change in Odds Crime Rate Above Median |
|---|---|---|
| rad | 2.0649560 | 106.5 |
| dis | 1.9249425 | 92.5 |
| ptratio | 1.3821333 | 38.2 |
| medv | 1.1168050 | 11.7 |
| age | 1.0334994 | 3.3 |
| tax | 0.9922739 | -0.8 |
| zn | 0.9336551 | -6.6 |
Here, we see that increasing rad by 1 unit increases the
odds of being above the median crime rate by 106.5 percent, increasing
zn by 1 unit decreases the odds of being above the median
crime rate by 6.6 percent, and so forth.
More Discussion of These Coefficients/Inference TK
Let’s check for possible multicollinearity within this model.
## zn nox age dis rad tax ptratio medv
## 1.789037 3.172660 1.701974 3.595939 1.697110 1.754274 1.865085 2.193689
All of the variance inflation factors are less than 5 so there are no issues of multicollinearity within this model.
We build our second model using the transformed data. We again start with a full model, then use the same stepwise lowest-AIC selection process we used for the first model to get a reduced model.
##
## Call:
## glm(formula = target ~ rm + age + rad + tax + ptratio + medv +
## log_zn + log_dis + inverse_nox + ptratio_sq, family = "binomial",
## data = train_df_trans)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7038 -0.1283 -0.0042 0.0024 3.7605
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 57.953075 16.505867 3.511 0.000446 ***
## rm -1.112460 0.691852 -1.608 0.107848
## age 0.039927 0.013075 3.054 0.002261 **
## rad 0.771059 0.160598 4.801 1.58e-06 ***
## tax -0.004776 0.003080 -1.551 0.120935
## ptratio -5.631650 1.953687 -2.883 0.003944 **
## medv 0.198305 0.072945 2.719 0.006557 **
## log_zn -0.233077 0.091993 -2.534 0.011288 *
## log_dis 3.327668 0.990327 3.360 0.000779 ***
## inverse_nox -10.240584 2.222121 -4.608 4.06e-06 ***
## ptratio_sq 0.164595 0.053176 3.095 0.001966 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 182.17 on 455 degrees of freedom
## AIC: 204.17
##
## Number of Fisher Scoring iterations: 9
Model 2, the stepwise transformed model, consists of 10 predictor variables and has an AIC of 204.17.
To interpret the model coefficients, we first need to exponentiate them again to discuss how 1 unit increases in each of these predictors would affect the odds of the crime rate being above the median.
| Feature | Coefficient | Percentage Change in Odds Crime Rate Above Median |
|---|---|---|
| log_dis | 27.8732635 | 2687.3 |
| rad | 2.1620549 | 116.2 |
| medv | 1.2193343 | 21.9 |
| ptratio_sq | 1.1789157 | 17.9 |
| age | 1.0407348 | 4.1 |
| tax | 0.9952353 | -0.5 |
| log_zn | 0.7920925 | -20.8 |
| rm | 0.3287494 | -67.1 |
| ptratio | 0.0035827 | -99.6 |
| inverse_nox | 0.0000357 | -100.0 |
A 1 unit increase in the transformed predictor log_dis
results in a 2687.3 percent increase in the odds of being above the
median crime rate, a 1 unit increase in the the transformed predictor
log_zn results in a 20.8 percent decrease in the odds of
being above the median crime rate, and so forth.
Note that since Model 2 uses the transformed predictor
inverse_nox instead of the original predictor
nox, the exponentiated coefficient for the transformed
predictor is very small instead of very large, and therefore a 1 unit
increase in inverse_nox logically results in a 100 percent
decrease in the odds of being above the median crime rate.
Note also that the coefficient for the transformed predictor
ptratio_sq is positive, where as the coefficient for its
untransformed lower order term ptratrio is negative. This
falls within expected behavior when adding a quadratic term, and
visually the relationship between the response and these predictors
would look like a concave upward parabola that decreases to a minimum,
then increases.
We check for possible multicollinearity within this model.
## rm age rad tax ptratio medv
## 4.344643 2.141214 1.614811 2.046894 476.759375 7.221909
## log_zn log_dis inverse_nox ptratio_sq
## 2.195675 4.183535 3.930079 461.852610
We see very high variance inflation factors for ptratio
and ptratio_sq. This is expected since we’ve included a
higher order term and its lower order term in the model. We want to keep
both, as keeping just the higher order term would be us insisting the
lower order term has no effect on the model, and we have no reason to do
that.
The only other variance inflation factor higher than 5 is for
medv. We will remove this variable from Model 2.
##
## Call:
## glm(formula = target ~ rm + age + rad + tax + ptratio + log_zn +
## log_dis + inverse_nox + ptratio_sq, family = "binomial",
## data = train_df_trans)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6744 -0.1765 -0.0083 0.0024 3.6286
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 60.140668 16.253255 3.700 0.000215 ***
## rm 0.494090 0.372585 1.326 0.184803
## age 0.020514 0.010356 1.981 0.047602 *
## rad 0.767895 0.155308 4.944 7.64e-07 ***
## tax -0.006503 0.003029 -2.147 0.031770 *
## ptratio -6.270963 1.882200 -3.332 0.000863 ***
## log_zn -0.246547 0.087797 -2.808 0.004983 **
## log_dis 2.169973 0.856591 2.533 0.011301 *
## inverse_nox -8.517841 1.979060 -4.304 1.68e-05 ***
## ptratio_sq 0.176822 0.051446 3.437 0.000588 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 190.28 on 456 degrees of freedom
## AIC: 210.28
##
## Number of Fisher Scoring iterations: 9
This increased the AIC for Model 2 to 210.28. This is still lower than the AIC from model 1.
The stepwise lowest-AIC selection method was utilized to optimize the
previous two models’ performance. For Model 3, we will use an
alternative approach, the bestglm function from the
bestglm package, which attempts to find the best subset
model by comparing all permutations for model optimization. This package
also allows us to use an alternative criterion to evaluate the model
performance. So we will evaluate using AIC in one iteration and BIC in
another. (This function has a limitation of 15 predictors given the
computational power needed for all variations of the model.)
## Morgan-Tatar search since family is non-gaussian.
## Morgan-Tatar search since family is non-gaussian.
One nice feature of the bestglm output is it includes
the best criterion version of each model size, which can allow us to
evaluate if certain combinations make intuitive sense.
##
## Call:
## glm(formula = y ~ ., family = family, data = Xi, weights = weights)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.5206 -0.2387 -0.0130 0.2454 3.3279
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -34.89257 5.19589 -6.715 1.88e-11 ***
## zn -0.06141 0.02780 -2.209 0.02717 *
## indus -0.07595 0.04544 -1.671 0.09463 .
## nox 42.22695 6.09505 6.928 4.27e-12 ***
## dis 0.48597 0.18853 2.578 0.00995 **
## ptratio 0.24496 0.10466 2.340 0.01926 *
## lstat 0.07749 0.04403 1.760 0.07838 .
## medv 0.18040 0.04026 4.480 7.45e-06 ***
## high_tax1 2.43610 0.60410 4.033 5.52e-05 ***
## middle_tax1 2.46552 0.48470 5.087 3.64e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 220.53 on 456 degrees of freedom
## AIC: 240.53
##
## Number of Fisher Scoring iterations: 8
For Model 3, the model with the lowest AIC returned by the
bestglm function has nine predictors and an AIC of 232.43.
We will refer to this model as Model 3A: Binned (Best AIC)
hereafter.
To interpret the model coefficients, we first need to exponentiate
them again to discuss how 1 unit increases in each of these predictors
would affect the odds of the crime rate being above the median. Since
the coefficient for nox is again very large, and we know
why, we omit it from this discussion.
| Feature | Coefficient | Percentage Change in Odds Crime Rate Above Median |
|---|---|---|
| middle_tax1 | 11.7695493 | 1077.0 |
| high_tax1 | 11.4284198 | 1042.8 |
| dis | 1.6257472 | 62.6 |
| ptratio | 1.2775687 | 27.8 |
| medv | 1.1977016 | 19.8 |
| lstat | 1.0805744 | 8.1 |
| zn | 0.9404404 | -6.0 |
| indus | 0.9268601 | -7.3 |
##
## Call:
## glm(formula = y ~ ., family = family, data = Xi, weights = weights)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2827 -0.2874 -0.0194 0.3163 3.1944
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -20.75803 2.42265 -8.568 < 2e-16 ***
## zn -0.04425 0.02050 -2.158 0.030917 *
## nox 30.82843 3.76920 8.179 2.86e-16 ***
## medv 0.10629 0.02742 3.877 0.000106 ***
## high_tax1 2.12259 0.54002 3.931 8.47e-05 ***
## middle_tax1 2.79677 0.48744 5.738 9.60e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 235.46 on 460 degrees of freedom
## AIC: 247.46
##
## Number of Fisher Scoring iterations: 7
The lowest BIC model returned by the bestglm function
has only six predictors. We will refer to this model as Model 3B: Binned
(Best BIC) hereafter.
To interpret the model coefficients, we first need to exponentiate
them again to discuss how 1 unit increases in each of these predictors
would affect the odds of the crime rate being above the median. Since
the coefficient for nox is again very large, and we know
why, we omit it from this discussion.
| Feature | Coefficient | Percentage Change in Odds Crime Rate Above Median |
|---|---|---|
| middle_tax1 | 16.3915419 | 1539.2 |
| high_tax1 | 8.3527231 | 735.3 |
| medv | 1.1121435 | 11.2 |
| zn | 0.9567148 | -4.3 |
Let’s review the models for any multicollinearity concerns:
None of the models have correlation concerns that would require further revision of the underlying selected predictors.
The main criterion we will use to select the model with the best performance is the ROC curve, but we need to check for goodness of fit issues prior to that. We will not select a model with a better ROC curve if we determine there is a suggestion of lack of fit for that model. The main criteria we will use to determine lack of fit are binned residual plots, marginal model plots, and the Hosmer-Lemeshow statistic.
To check for goodness of fit, we first plot the residuals for each model against their fitted values.
The binned residual plots don’t quite eliminate the patterns in the raw residuals as we expected. We refrain from concluding anything determinant about the fit for each model just by looking at these plots.
Next we create marginal model plots for the response and each
predictor in each model. (Note that the mmps function from
the car package used to generate these plots skips any
factors and interaction terms within the model intentionally. So for
Models 3A and 3B, which include dummy variables we created, fewer
predictors are plotted than are present within the models.)
There is very good agreement between the two fits for all of the
predictors in the marginal model plots for Models 1 and 3B. There is one
notable deviation of fit for the relationship between
target and rm for Model 2, and one notable
deviation of fit for the relationship between target and
indus in Model 3A.
We calculate the Hosmer-Lemeshow statistic for each model to further check for lack of fit.
| Model | HL Statistic | DoF | P Value |
|---|---|---|---|
| Model 1: Stepwise Untransformed | 12.57643 | 8 | 0.1272784 |
| Model 2: Stepwise Transformed | 52.0694 | 8 | 1.631963e-08 |
| Model 3A: Binned (Best AIC) | 13.97434 | 8 | 0.08243674 |
| Model 3B: Binned (Best BIC) | 19.66757 | 8 | 0.01166965 |
The moderate p-values here for Models 1 and 3A suggest no lack of fit, while the low p-values for Models 2 and 3B do. This is a little surprising for Models 3A and 3B given the marginal models plots issues for the former and the lack thereof for the latter. Note that this statistic can’t tell us whether any of the models are overfitting the data though.
We produce ROC curves for each model to determine which performs better on the training data.
Model 2 performs best on the training data based on its ROC curve, but the only model for which there was no suggestion of lack of fit in both the marginal models plots and the HL statistics was Model 1. Its ROC curve is very similar that of Model 2 anyway, so we prefer Model 1 and select it for final evaluation.
knitr::opts_chunk$set(echo = TRUE, warning=FALSE)
library(tidyverse)
library(modelr)
library(DataExplorer)
library(correlationfunnel)
library(caret)
library(knitr)
library(confintr)
library(psych)
library(car)
library(corrplot)
library(RColorBrewer)
library(MASS)
select <- dplyr::select
library(glmtoolbox)
library(cowplot)
library(pROC)
library(bestglm)
train_df <- read.csv("https://raw.githubusercontent.com/andrewbowen19/businessAnalyticsDataMiningDATA621/main/data/crime-training-data_modified.csv")
test_df <- read.csv("https://raw.githubusercontent.com/andrewbowen19/businessAnalyticsDataMiningDATA621/main/data/crime-evaluation-data_modified.csv")
dim(train_df)
train_df |>
glimpse()
train_df <- train_df |>
mutate(chas = as.factor(chas), target = as.factor(target))
summary(train_df)
# standard deviation
sapply((train_df |> select (-chas, -target)), sd)
sum(is.na(train_df))
train_df_binarized <- train_df |>
binarize(n_bins = 5, thresh_infreq = 0.01, name_infreq = "OTHER",
one_hot = TRUE)
train_df_corr <- train_df_binarized |>
correlate(target__1)
train_df_corr |>
plot_correlation_funnel()
rm(train_df_binarized, train_df_corr)
factors <- c("chas", "target")
cramersv <- round(cramersv(train_df |> select(all_of(factors))), 5)
pearson <- round(cor(as.numeric(train_df$chas), as.numeric(train_df$target), method = "pearson"), 5)
(cramersv == pearson)
corrplot(cor(train_df |> select(-target) |> mutate(chas = as.numeric(chas))),
method="color",
diag=FALSE,
type="lower",
addCoef.col = "black",
number.cex=0.70)
par(mfrow=c(3,4))
par(mai=c(.3,.3,.3,.3))
variables <- names(train_df)
for (i in 1:(length(variables)-1)) {
if (variables[i] %in% factors){
hist(as.numeric(train_df[[variables[i]]]), main = variables[i],
col = "lightblue")
}else{
hist(train_df[[variables[i]]], main = variables[i], col = "lightblue")
}
}
nzv <- nearZeroVar(train_df |> select(-target), saveMetrics = TRUE)
knitr::kable(nzv)
train_df |>
dplyr::select(-chas) |>
gather(key, value, -target) |>
mutate(key = factor(key),
target = factor(target)) |>
ggplot(aes(x = key, y = value)) +
geom_boxplot(aes(fill = target)) +
scale_x_discrete(labels = NULL, breaks = NULL) +
facet_wrap(~ key, scales = 'free', ncol = 3) +
scale_fill_brewer(palette = "Paired") +
theme_minimal() +
theme(strip.text = element_text(face = "bold"))
skewed <- c("zn", "dis", "lstat", "nox", "age", "ptratio")
train_df_trans <- train_df
for (i in 1:(length(skewed))){
#Add a small constant to columns with any 0 values
if (sum(train_df_trans[[skewed[i]]] == 0) > 0){
train_df_trans[[skewed[i]]] <-
train_df_trans[[skewed[i]]] + 0.001
}
}
for (i in 1:(length(skewed))){
if (i == 1){
lambdas <- c()
}
bc <- boxcox(lm(train_df_trans[[skewed[i]]] ~ 1),
lambda = seq(-2, 2, length.out = 81),
plotit = FALSE)
lambda <- bc$x[which.max(bc$y)]
lambdas <- append(lambdas, lambda)
}
lambdas <- as.data.frame(cbind(skewed, lambdas))
adj <- c("log", "log", "log", "inverse", "no transformation", "square")
lambdas <- cbind(lambdas, adj)
cols <- c("Skewed Variable", "Ideal Lambda Proposed by Box-Cox", "Reasonable Alternative Transformation")
colnames(lambdas) <- cols
kable(lambdas, format = "simple")
remove <- c("zn", "dis", "lstat", "nox")
train_df_trans <- train_df_trans |>
mutate(log_zn = log(zn),
log_dis = log(dis),
log_lstat = log(lstat),
inverse_nox = nox^-1,
ptratio_sq = ptratio^2) |>
select(-all_of(remove))
test_df_trans <- test_df
for (i in 1:(length(skewed))){
#Add a small constant to columns with any 0 values
if (sum(test_df_trans[[skewed[i]]] == 0) > 0){
test_df_trans[[skewed[i]]] <-
test_df_trans[[skewed[i]]] + 0.001
}
}
test_df_trans <- test_df_trans |>
mutate(log_zn = log(zn),
log_dis = log(dis),
log_lstat = log(lstat),
inverse_nox = nox^-1,
ptratio_sq = ptratio^2) |>
select(-all_of(remove))
ggplot(train_df,aes(x=rad)) +
geom_bar()
remove <- c("rad", "tax")
train_df2 <- train_df |>
mutate(rad_less5=as.factor(ifelse(rad<5,1,0)),
high_tax=as.factor(ifelse(tax>400,1,0)),
middle_tax=as.factor(ifelse(tax<400 & tax>=300,1,0))) |>
select(-all_of(remove)) |>
relocate(target,.after=last_col())
test_df2 <- test_df |>
mutate(rad_less5=as.factor(ifelse(rad<5,1,0)),
high_tax=as.factor(ifelse(tax>400,1,0)),
middle_tax=as.factor(ifelse(tax<400 & tax>=300,1,0))) |>
select(-all_of(remove))
corrplot(cor(train_df2 |> select(-target) |> mutate(chas = as.numeric(chas),
rad_less5 = as.numeric(rad_less5),
high_tax = as.numeric(high_tax),
middle_tax = as.numeric(middle_tax))),
method="color",
diag=FALSE,
type="lower",
addCoef.col = "black",
number.cex=0.70)
train_df <- train_df |>
select(-chas)
train_df_trans <- train_df_trans |>
select(-chas)
train_df2 <- train_df2 |>
select(-chas)
test_df <- test_df |>
select(-chas)
test_df_trans <- test_df_trans |>
select(-chas)
test_df2 <- test_df2 |>
select(-chas)
glm_full <- glm(target~., family='binomial', data=train_df)
model_1 <- stepAIC(glm_full, trace=0)
summary(model_1)
beta <- coef(model_1)
beta_exp <- as.data.frame(exp(beta)) |>
rownames_to_column()
cols <- c("Feature", "Coefficient")
colnames(beta_exp) <- cols
beta_exp <- beta_exp |>
filter(Feature != "(Intercept)")
knitr::kable(beta_exp, format = "simple")
beta_exp <- beta_exp |>
filter(Feature != "nox") |>
mutate(diff = round(Coefficient - 1, 3) * 100) |>
arrange(desc(diff))
cols <- c("Feature", "Coefficient", "Percentage Change in Odds Crime Rate Above Median")
colnames(beta_exp) <- cols
knitr::kable(beta_exp, format = "simple")
vif(model_1)
glm_full2 <- glm(target~., family='binomial', data=train_df_trans)
model_2 <- stepAIC(glm_full2, trace=0)
summary(model_2)
beta2 <- coef(model_2)
beta2_exp <- as.data.frame(exp(beta2)) |>
rownames_to_column()
cols <- c("Feature", "Coefficient")
colnames(beta2_exp) <- cols
beta2_exp <- beta2_exp |>
filter(Feature != "(Intercept)")
beta2_exp <- beta2_exp |>
mutate(diff = round(Coefficient - 1, 3) * 100) |>
arrange(desc(diff))
cols <- c("Feature", "Coefficient", "Percentage Change in Odds Crime Rate Above Median")
colnames(beta2_exp) <- cols
knitr::kable(beta2_exp, format = "simple")
vif(model_2)
model_2 <- update(model_2, ~ . - medv)
summary(model_2)
model_3_aic <- bestglm(train_df2,IC="AIC",family=binomial)
model_3_bic <- bestglm(train_df2,IC="BIC",family=binomial)
#log_aic_tax$BestModels #commented out since I'm not sure what you want to extract from what object here
summary(model_3_aic$BestModel)
beta3a <- coef(model_3_aic$BestModel)
beta3a_exp <- as.data.frame(exp(beta3a)) |>
rownames_to_column()
cols <- c("Feature", "Coefficient")
colnames(beta3a_exp) <- cols
remove <- c("nox", "(Intercept)")
beta3a_exp <- beta3a_exp |>
filter(!Feature %in% remove)
beta3a_exp <- beta3a_exp |>
mutate(diff = round(Coefficient - 1, 3) * 100) |>
arrange(desc(diff))
cols <- c("Feature", "Coefficient", "Percentage Change in Odds Crime Rate Above Median")
colnames(beta3a_exp) <- cols
knitr::kable(beta3a_exp, format = "simple")
summary(model_3_bic$BestModel)
beta3b <- coef(model_3_bic$BestModel)
beta3b_exp <- as.data.frame(exp(beta3b)) |>
rownames_to_column()
cols <- c("Feature", "Coefficient")
colnames(beta3b_exp) <- cols
remove <- c("nox", "(Intercept)")
beta3b_exp <- beta3b_exp |>
filter(!Feature %in% remove)
beta3b_exp <- beta3b_exp |>
mutate(diff = round(Coefficient - 1, 3) * 100) |>
arrange(desc(diff))
cols <- c("Feature", "Coefficient", "Percentage Change in Odds Crime Rate Above Median")
colnames(beta3b_exp) <- cols
knitr::kable(beta3b_exp, format = "simple")
viz_vif <- function(model_input, title){
barplot(vif(model_input), main = title, col = "steelblue",
ylim=c(0,7), las = 2) #create horizontal bar chart to display each VIF value
abline(h = 5, lwd = 3, lty = 2)
}
par(mfrow=c(1,2))
viz_vif(model_3_aic$BestModel, "Model 3A: Binned (Best AIC) VIF")
viz_vif(model_3_bic$BestModel, "Model 3B: Binned (Best BIC) VIF")
linpred1 <- predict(model_1)
predprob1 <- predict(model_1, type = "response")
rawres1 <- residuals(model_1, type = "response")
res1 <- as.data.frame(cbind(linpred1, predprob1, rawres1))
linpred2 <- predict(model_2)
predprob2 <- predict(model_2, type = "response")
rawres2 <- residuals(model_2, type = "response")
res2 <- as.data.frame(cbind(linpred2, predprob2, rawres2))
pa <- res1 |>
ggplot() +
geom_point(aes(x = linpred1, y = rawres1)) +
labs(x = "Linear Predictor", y = "Residuals", title = "Not Binned")
train_df_mod1 <- train_df |>
mutate(residuals = residuals(model_1),
linpred = predict(model_1),
predprob = predict(model_1, type = "response"))
binned1 <- train_df_mod1 |>
group_by(cut(linpred, breaks = unique(quantile(linpred,
probs = seq(.05, 1, .05)))))
diag1 <- binned1 |>
summarize(residuals = mean(residuals), linpred = mean(linpred))
pb <- diag1 |>
ggplot() +
geom_point(aes(x = linpred, y = residuals)) +
labs(x = "Linear Predictor", y = "Residuals", title = "Binned")
pc <- res2 |>
ggplot() +
geom_point(aes(x = linpred2, y = rawres2)) +
labs(x = "Linear Predictor", y = "Residuals", title = "Not Binned")
train_df_mod2 <- train_df_trans |>
mutate(residuals = residuals(model_2),
linpred = predict(model_2),
predprob = predict(model_2, type = "response"))
binned2 <- train_df_mod2 |>
group_by(cut(linpred, breaks = unique(quantile(linpred,
probs = seq(.05, 1, .05)))))
diag2 <- binned2 |>
summarize(residuals = mean(residuals), linpred = mean(linpred))
pd <- diag2 |>
ggplot() +
geom_point(aes(x = linpred, y = residuals)) +
labs(x = "Linear Predictor", y = "Residuals", title = "Binned")
title1 <- ggdraw() +
draw_label("Model 1: Stepwise Untransformed")
p1a <- plot_grid(pa, pb, ncol = 2, align = "h", axis = "b")
p1 <- plot_grid(title1, p1a, ncol = 1, rel_heights = c(0.1, 1))
title2 <- ggdraw() +
draw_label("Model 2: Stepwise Transformed")
p2a <- plot_grid(pc, pd, ncol = 2, align = "h", axis = "b")
p2 <- plot_grid(title2, p2a, ncol = 1, rel_heights = c(0.1, 1))
p1
p2
linpred3a <- predict(model_3_aic$BestModel)
predprob3a <- predict(model_3_aic$BestModel, type = "response")
rawres3a <- residuals(model_3_aic$BestModel, type = "response")
res3a <- as.data.frame(cbind(linpred3a, predprob3a, rawres3a))
linpred3b <- predict(model_3_bic$BestModel)
predprob3b <- predict(model_3_bic$BestModel, type = "response")
rawres3b <- residuals(model_3_bic$BestModel, type = "response")
res3b <- as.data.frame(cbind(linpred3b, predprob3b, rawres3b))
pe <- res3a |>
ggplot() +
geom_point(aes(x = linpred3a, y = rawres3a)) +
labs(x = "Linear Predictor", y = "Residuals", title = "Not Binned")
train_df_mod3a <- train_df |>
mutate(residuals = residuals(model_3_aic$BestModel),
linpred = predict(model_3_aic$BestModel),
predprob = predict(model_3_aic$BestModel, type = "response"))
binned3a <- train_df_mod3a |>
group_by(cut(linpred, breaks = unique(quantile(linpred,
probs = seq(.05, 1, .05)))))
diag3a <- binned3a |>
summarize(residuals = mean(residuals), linpred = mean(linpred))
pf <- diag3a |>
ggplot() +
geom_point(aes(x = linpred, y = residuals)) +
labs(x = "Linear Predictor", y = "Residuals", title = "Binned")
pg <- res3b |>
ggplot() +
geom_point(aes(x = linpred3b, y = rawres3b)) +
labs(x = "Linear Predictor", y = "Residuals", title = "Not Binned")
train_df_mod3b <- train_df |>
mutate(residuals = residuals(model_3_bic$BestModel),
linpred = predict(model_3_bic$BestModel),
predprob = predict(model_3_bic$BestModel, type = "response"))
binned3b <- train_df_mod3b |>
group_by(cut(linpred, breaks = unique(quantile(linpred,
probs = seq(.05, 1, .05)))))
diag3b <- binned3b |>
summarize(residuals = mean(residuals), linpred = mean(linpred))
ph <- diag3b |>
ggplot() +
geom_point(aes(x = linpred, y = residuals)) +
labs(x = "Linear Predictor", y = "Residuals", title = "Binned")
title3 <- ggdraw() +
draw_label("Model 3A: Binned (Best AIC)")
p3a1 <- plot_grid(pe, pf, ncol = 2, align = "h", axis = "b")
p3a <- plot_grid(title3, p3a1, ncol = 1, rel_heights = c(0.1, 1))
title4 <- ggdraw() +
draw_label("Model 3B: Binned (Best BIC)")
p3b1 <- plot_grid(pg, ph, ncol = 2, align = "h", axis = "b")
p3b <- plot_grid(title4, p3b1, ncol = 1, rel_heights = c(0.1, 1))
p3a
p3b
palette <- brewer.pal(n = 12, name = "Paired")
mmps(model_1, layout = c(3, 3), grid = FALSE, col.line = palette[c(2,6)],
main = "Model 1: Stepwise Untransformed: Marginal Model Plots")
mmps(model_2, layout = c(3, 4), grid = FALSE, col.line = palette[c(2,6)],
main = "Model 2: Stepwise Transformed: Marginal Model Plots")
model_3_aic_for_mmps <- glm(formula(model_3_aic$BestModel), family = "binomial",
data = train_df2)
mmps(model_3_aic_for_mmps, layout = c(3, 3), grid = FALSE, col.line = palette[c(2,6)],
main = "Model 3A: Binned: Best AIC: Marginal Model Plots", ylab = "target")
model_3_bic_for_mmps <- glm(formula(model_3_bic$BestModel), family = "binomial",
data = train_df2)
mmps(model_3_bic_for_mmps, layout = c(2, 3), grid = FALSE, col.line = palette[c(2,6)],
main = "Model 3B: Binned: Best BIC: Marginal Model Plots", ylab = "target")
hlstat1 <- hltest(model_1, verbose = FALSE)
hlstat2 <- hltest(model_2, verbose = FALSE)
hlstat3a <- hltest(model_3_aic$BestModel, verbose = FALSE)
hlstat3b <- hltest(model_3_bic$BestModel, verbose = FALSE)
models <- c("Model 1: Stepwise Untransformed",
"Model 2: Stepwise Transformed",
"Model 3A: Binned (Best AIC)",
"Model 3B: Binned (Best BIC)")
hl_tbl <- as.data.frame(cbind(models, rbind(hlstat1[2:4], hlstat2[2:4],
hlstat3a[2:4], hlstat3b[2:4])))
cols <- c("Model", "HL Statistic", "DoF", "P Value")
colnames(hl_tbl) <- cols
knitr::kable(hl_tbl, format = "simple")
par(mfrow=c(2,2))
par(mai=c(.3,.3,.3,.3))
roc1 <- roc(train_df_mod1$target, train_df_mod1$predprob, plot = TRUE,
print.auc = TRUE, show.thres = TRUE)
title(main = "Model 1: ROC")
roc2 <- roc(train_df_mod2$target, train_df_mod2$predprob, plot = TRUE,
print.auc = TRUE, show.thres = TRUE)
title(main = "Model 2: ROC")
roc3a <- roc(train_df_mod3a$target, train_df_mod3a$predprob, plot = TRUE,
print.auc = TRUE, show.thres = TRUE)
title(main = "Model 3A: ROC")
roc3b <- roc(train_df_mod3b$target, train_df_mod3b$predprob, plot = TRUE,
print.auc = TRUE, show.thres = TRUE)
title(main = "Model 3B: ROC")