The purpose of this analysis is to build a logistic regression model that will predict whether a particular neighborhood in Boston is above or below the median crime level for the city.Our dataset includes information on 466 Boston neighborhoods. Each neighborhood has 13 potential predictor variables, and 1 response variable. The response variable is “target”, which is “1” if the neighbhorhood is above the city’s median crime level, and 0 if not. We need to build various binary logistic models after preparing our data set and select the model that is best suited. We are first loading data transforming the data and then building models using various libraries and functions.
First i m loading libraries that i will be using for the assignment
NOw loading the data set crime training and evaluation sets from Github and displaying some of records to understand data better
crime <- read.csv("https://raw.githubusercontent.com/yathdeep/msds-data621/main/crime-training-data_modified.csv")
crime_eval <- read.csv("https://raw.githubusercontent.com/yathdeep/msds-data621/main/crime-evaluation-data_modified.csv")
kable(head(crime))| zn | indus | chas | nox | rm | age | dis | rad | tax | ptratio | lstat | medv | target |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 19.58 | 0 | 0.605 | 7.929 | 96.2 | 2.0459 | 5 | 403 | 14.7 | 3.70 | 50.0 | 1 |
| 0 | 19.58 | 1 | 0.871 | 5.403 | 100.0 | 1.3216 | 5 | 403 | 14.7 | 26.82 | 13.4 | 1 |
| 0 | 18.10 | 0 | 0.740 | 6.485 | 100.0 | 1.9784 | 24 | 666 | 20.2 | 18.85 | 15.4 | 1 |
| 30 | 4.93 | 0 | 0.428 | 6.393 | 7.8 | 7.0355 | 6 | 300 | 16.6 | 5.19 | 23.7 | 0 |
| 0 | 2.46 | 0 | 0.488 | 7.155 | 92.2 | 2.7006 | 3 | 193 | 17.8 | 4.82 | 37.9 | 0 |
| 0 | 8.56 | 0 | 0.520 | 6.781 | 71.3 | 2.8561 | 5 | 384 | 20.9 | 7.67 | 26.5 | 0 |
Checking for any missing values in crime training data , if there is any missing data we need to imputate the data
## [1] FALSE
##
## 0 1
## 237 229
So the results here say there is no missing data.
Out Of the 13 predictor variables, 12 were numeric and 1 was caterogical. The categorical data would need to be converted in order to be approrpriately used in a generalized linear model. Also while examining target variable we noticed 229 neighbourhoods are marked as 1 that is above the median and 237 are marked as 0.
Displaying summary of each variable to have better undertstanding
| zn | indus | chas | nox | rm | age | |
|---|---|---|---|---|---|---|
| Min. : 0.00 | Min. : 0.460 | Min. :0.00000 | Min. :0.3890 | Min. :3.863 | Min. : 2.90 | |
| 1st Qu.: 0.00 | 1st Qu.: 5.145 | 1st Qu.:0.00000 | 1st Qu.:0.4480 | 1st Qu.:5.887 | 1st Qu.: 43.88 | |
| Median : 0.00 | Median : 9.690 | Median :0.00000 | Median :0.5380 | Median :6.210 | Median : 77.15 | |
| Mean : 11.58 | Mean :11.105 | Mean :0.07082 | Mean :0.5543 | Mean :6.291 | Mean : 68.37 | |
| 3rd Qu.: 16.25 | 3rd Qu.:18.100 | 3rd Qu.:0.00000 | 3rd Qu.:0.6240 | 3rd Qu.:6.630 | 3rd Qu.: 94.10 | |
| Max. :100.00 | Max. :27.740 | Max. :1.00000 | Max. :0.8710 | Max. :8.780 | Max. :100.00 |
| dis | rad | tax | ptratio | lstat | medv | |
|---|---|---|---|---|---|---|
| Min. : 1.130 | Min. : 1.00 | Min. :187.0 | Min. :12.6 | Min. : 1.730 | Min. : 5.00 | |
| 1st Qu.: 2.101 | 1st Qu.: 4.00 | 1st Qu.:281.0 | 1st Qu.:16.9 | 1st Qu.: 7.043 | 1st Qu.:17.02 | |
| Median : 3.191 | Median : 5.00 | Median :334.5 | Median :18.9 | Median :11.350 | Median :21.20 | |
| Mean : 3.796 | Mean : 9.53 | Mean :409.5 | Mean :18.4 | Mean :12.631 | Mean :22.59 | |
| 3rd Qu.: 5.215 | 3rd Qu.:24.00 | 3rd Qu.:666.0 | 3rd Qu.:20.2 | 3rd Qu.:16.930 | 3rd Qu.:25.00 | |
| Max. :12.127 | Max. :24.00 | Max. :711.0 | Max. :22.0 | Max. :37.970 | Max. :50.00 |
Histograms of each variable
here using Hmisc library and using function hist.data.frame()
we can see near normal distributions for medv and rm. With lstat,dis,rad,tax, and nox farther off. We may need to transform them. In particular we see dis,lstat, and age victim of a stride and true right skew.
Building a correlation plot for the variables to understand correlation between variables
## corrplot 0.92 loaded
We can see strong positive correlations in the following sets-indus, age, rad, tax, lstat, target :: nox We can see strong negative correlations in the following sets-indus, nox, age, rad, tax, target :: dis
Creating box plots of each variable against target
Looking at the box plots we see that in addition to fixing the distributional right skews of dis, lstat and age we should consider fixing zn, nox, tax, and rad through transformations to minimize their target variation.
This data set has no missing values so no imputations here
Trying to make three datasets for preparation of this data
Original dataset with transformed [dis,lstat,age] with log
A log normalized dataset
Transform [dis,lstat,age,zn,nox,tax,rad] with both quadratic and log terms.
Here not doing any tranformation to the variables - zn,chas,target
Transforming the data
## Warning: package 'mice' was built under R version 4.0.5
##
## Attaching package: 'mice'
## The following object is masked from 'package:RCurl':
##
## complete
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## cbind, rbind
crime_N = data.frame(
dis_n = log(crime$dis),
lstat_n = log(crime$lstat),
age_n = log(crime$age),
zn_n = crime$zn^2,
nox_n = crime$nox^2,
tax_n = crime$tax^2,
rad_n = crime$rad^2)
dataset1 = crime
dataset2 = log(crime)
dataset2$zn = crime$zn
dataset2$target = crime$target
dataset2$chas = crime$chas
dataset3 = cbind(crime,crime_N)This model i m including all variables this will help us understand which variables are significant and allow us to make better model. This model will be based off of the original data - before transformed
##
## Call:
## glm(formula = target ~ ., family = binomial, data = crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8464 -0.1445 -0.0017 0.0029 3.4665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -40.822934 6.632913 -6.155 7.53e-10 ***
## zn -0.065946 0.034656 -1.903 0.05706 .
## indus -0.064614 0.047622 -1.357 0.17485
## chas 0.910765 0.755546 1.205 0.22803
## nox 49.122297 7.931706 6.193 5.90e-10 ***
## rm -0.587488 0.722847 -0.813 0.41637
## age 0.034189 0.013814 2.475 0.01333 *
## dis 0.738660 0.230275 3.208 0.00134 **
## rad 0.666366 0.163152 4.084 4.42e-05 ***
## tax -0.006171 0.002955 -2.089 0.03674 *
## ptratio 0.402566 0.126627 3.179 0.00148 **
## lstat 0.045869 0.054049 0.849 0.39608
## medv 0.180824 0.068294 2.648 0.00810 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 192.05 on 453 degrees of freedom
## AIC: 218.05
##
## Number of Fisher Scoring iterations: 9
This original dataset model with all variables has an AIC of 214.18 with a sample over 400, meaning it may be an accurate metric over BIC.
Residual deviance is 86.15 and Null deviance of almost 646. Significant variables include [rad,nox, dis,medv, ptratio].
In this model we are performing backward elimination using only significant variales we got after running model 1. Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal outputs are produced. Only the final output will be shown
## Start: AIC=260.51
## target ~ rad + nox + dis + medv + ptratio
##
## Df Deviance AIC
## <none> 46.307 260.51
## - dis 1 46.597 261.42
## - ptratio 1 46.622 261.67
## - medv 1 47.468 270.04
## - rad 1 51.181 305.14
## - nox 1 55.560 343.39
##
## Call:
## glm(formula = target ~ rad + nox + dis + medv + ptratio, data = crime)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.57941 -0.19595 -0.07129 0.14505 0.89332
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.319691 0.291629 -4.525 7.69e-06 ***
## rad 0.016581 0.002383 6.958 1.19e-11 ***
## nox 2.300490 0.239958 9.587 < 2e-16 ***
## dis -0.019333 0.011387 -1.698 0.090229 .
## medv 0.007108 0.002093 3.396 0.000744 ***
## ptratio 0.015801 0.008928 1.770 0.077426 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 0.1006676)
##
## Null deviance: 116.466 on 465 degrees of freedom
## Residual deviance: 46.307 on 460 degrees of freedom
## AIC: 260.51
##
## Number of Fisher Scoring iterations: 2
This model has higher AIC value compared to model 1 but residual veviance is much lower of 46.307
##
## Call:
## glm(formula = target ~ ., family = binomial, data = dataset2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.90738 -0.16285 -0.00161 0.08746 3.13907
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -15.96595 11.72606 -1.362 0.17333
## zn -0.03216 0.02687 -1.197 0.23145
## indus 0.24931 0.52151 0.478 0.63262
## chas 0.86893 0.73997 1.174 0.24028
## nox 24.13637 3.91814 6.160 7.27e-10 ***
## rm 2.11032 3.70451 0.570 0.56891
## age 0.85132 0.63686 1.337 0.18131
## dis 2.69643 0.84395 3.195 0.00140 **
## rad 2.98142 0.68476 4.354 1.34e-05 ***
## tax -1.51088 1.12043 -1.348 0.17750
## ptratio 5.52739 2.12059 2.607 0.00915 **
## lstat 0.17708 0.66971 0.264 0.79146
## medv 2.32688 1.48654 1.565 0.11751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 206.10 on 453 degrees of freedom
## AIC: 232.1
##
## Number of Fisher Scoring iterations: 8
Model 3 has AIC value of 232.07 and very high residual deviance value of 204
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = target ~ ., family = binomial, data = dataset3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1703 -0.1209 0.0000 0.0089 3.8317
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.339e+01 2.913e+01 0.460 0.645677
## zn 1.068e-01 2.030e-01 0.526 0.598755
## indus 2.002e-01 1.169e-01 1.712 0.086861 .
## chas -5.510e-01 9.694e-01 -0.568 0.569783
## nox -2.920e+02 1.182e+02 -2.469 0.013537 *
## rm -1.962e+00 1.024e+00 -1.916 0.055373 .
## age 8.780e-02 3.888e-02 2.258 0.023925 *
## dis -3.540e+00 1.268e+00 -2.791 0.005251 **
## rad 1.150e-01 4.250e-01 0.271 0.786639
## tax 2.332e-01 7.049e-02 3.308 0.000939 ***
## ptratio 7.702e-01 2.055e-01 3.748 0.000178 ***
## lstat 3.121e-01 1.723e-01 1.812 0.070047 .
## medv 3.201e-01 1.105e-01 2.898 0.003757 **
## dis_n 1.598e+01 4.899e+00 3.263 0.001104 **
## lstat_n -3.835e+00 2.219e+00 -1.728 0.083914 .
## age_n -1.873e+00 1.530e+00 -1.224 0.221133
## zn_n -4.618e-03 7.091e-03 -0.651 0.514853
## nox_n 3.243e+02 1.136e+02 2.854 0.004315 **
## tax_n -3.718e-04 1.099e-04 -3.382 0.000719 ***
## rad_n 8.662e-02 3.181e-02 2.723 0.006471 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 130.53 on 446 degrees of freedom
## AIC: 170.53
##
## Number of Fisher Scoring iterations: 12
Model 4 has AIC value of 168.92 and residual deviance of 126.92
We are choosing here model 2 because it has lower residual deviance
Now we can run the anova() function on the model to analyze the table of deviance
## Analysis of Deviance Table
##
## Model: gaussian, link: identity
##
## Response: target
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 465 116.466
## rad 1 45.948 464 70.518 < 2.2e-16 ***
## nox 1 22.356 463 48.162 < 2.2e-16 ***
## dis 1 0.693 462 47.469 0.008677 **
## medv 1 0.846 461 46.622 0.003736 **
## ptratio 1 0.315 460 46.307 0.076764 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Explaination of anova () is in the below section
We have selected model 2 with lower residual deviance and less features. From the anova() function the difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding rad, nox and dis significantly reduces the residual deviance. The other variables seem to improve the model less. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. Ultimately what you would like to see is a significant drop in deviance and the AIC.
Lets evaluate our model
As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier. Here is a ROC curve of our selected model and area under the curve:
crime$predict <- predict(reducedm2, crime, type='response')
myroc <- roc(crime$target, crime$predict, plot=T, asp=NA,
legacy.axes=T, main = "ROC Curve", ret="tp", col="blue")## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## $auc
## Area under the curve: 0.9507
The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5.