DATA-621 Project 3

Cover Page
DATA EXPLORATION
DATA PREPARATION
Build Models
SELECT MODEL
TEST DATA PREDICTIONS

Cover Page

Abstract

The purpose of this analysis is to build a logistic regression model that will predict whether a particular neighborhood in Boston is above or below the median crime level for the city.Our dataset includes information on 466 Boston neighborhoods. Each neighborhood has 13 potential predictor variables, and 1 response variable. The response variable is “target”, which is “1” if the neighbhorhood is above the city’s median crime level, and 0 if not. We need to build various binary logistic models after preparing our data set and select the model that is best suited. We are first loading data transforming the data and then building models using various libraries and functions.

DATA EXPLORATION

First i m loading libraries that i will be using for the assignment

NOw loading the data set crime training and evaluation sets from Github and displaying some of records to understand data better

crime <- read.csv("https://raw.githubusercontent.com/yathdeep/msds-data621/main/crime-training-data_modified.csv")
crime_eval <- read.csv("https://raw.githubusercontent.com/yathdeep/msds-data621/main/crime-evaluation-data_modified.csv")
kable(head(crime))

zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	lstat	medv	target
0	19.58	0	0.605	7.929	96.2	2.0459	5	403	14.7	3.70	50.0	1
0	19.58	1	0.871	5.403	100.0	1.3216	5	403	14.7	26.82	13.4	1
0	18.10	0	0.740	6.485	100.0	1.9784	24	666	20.2	18.85	15.4	1
30	4.93	0	0.428	6.393	7.8	7.0355	6	300	16.6	5.19	23.7	0
0	2.46	0	0.488	7.155	92.2	2.7006	3	193	17.8	4.82	37.9	0
0	8.56	0	0.520	6.781	71.3	2.8561	5	384	20.9	7.67	26.5	0

Checking for any missing values in crime training data , if there is any missing data we need to imputate the data

any(is.na(crime))

## [1] FALSE

#examining target variable
w <- table(crime$target)
w

## 
##   0   1 
## 237 229

So the results here say there is no missing data.

Out Of the 13 predictor variables, 12 were numeric and 1 was caterogical. The categorical data would need to be converted in order to be approrpriately used in a generalized linear model. Also while examining target variable we noticed 229 neighbourhoods are marked as 1 that is above the median and 237 are marked as 0.

Displaying summary of each variable to have better undertstanding

kable(summary(crime[1:6]))

zn	indus	chas	nox	rm	age
Min. : 0.00	Min. : 0.460	Min. :0.00000	Min. :0.3890	Min. :3.863	Min. : 2.90
1st Qu.: 0.00	1st Qu.: 5.145	1st Qu.:0.00000	1st Qu.:0.4480	1st Qu.:5.887	1st Qu.: 43.88
Median : 0.00	Median : 9.690	Median :0.00000	Median :0.5380	Median :6.210	Median : 77.15
Mean : 11.58	Mean :11.105	Mean :0.07082	Mean :0.5543	Mean :6.291	Mean : 68.37
3rd Qu.: 16.25	3rd Qu.:18.100	3rd Qu.:0.00000	3rd Qu.:0.6240	3rd Qu.:6.630	3rd Qu.: 94.10
Max. :100.00	Max. :27.740	Max. :1.00000	Max. :0.8710	Max. :8.780	Max. :100.00

kable(summary(crime[7:12]))

dis	rad	tax	ptratio	lstat	medv
Min. : 1.130	Min. : 1.00	Min. :187.0	Min. :12.6	Min. : 1.730	Min. : 5.00
1st Qu.: 2.101	1st Qu.: 4.00	1st Qu.:281.0	1st Qu.:16.9	1st Qu.: 7.043	1st Qu.:17.02
Median : 3.191	Median : 5.00	Median :334.5	Median :18.9	Median :11.350	Median :21.20
Mean : 3.796	Mean : 9.53	Mean :409.5	Mean :18.4	Mean :12.631	Mean :22.59
3rd Qu.: 5.215	3rd Qu.:24.00	3rd Qu.:666.0	3rd Qu.:20.2	3rd Qu.:16.930	3rd Qu.:25.00
Max. :12.127	Max. :24.00	Max. :711.0	Max. :22.0	Max. :37.970	Max. :50.00

Histograms of each variable

here using Hmisc library and using function hist.data.frame()

hist.data.frame(crime)

we can see near normal distributions for medv and rm. With lstat,dis,rad,tax, and nox farther off. We may need to transform them. In particular we see dis,lstat, and age victim of a stride and true right skew.

Building a correlation plot for the variables to understand correlation between variables

## corrplot 0.92 loaded

We can see strong positive correlations in the following sets-indus, age, rad, tax, lstat, target :: nox We can see strong negative correlations in the following sets-indus, nox, age, rad, tax, target :: dis

Creating box plots of each variable against target

Looking at the box plots we see that in addition to fixing the distributional right skews of dis, lstat and age we should consider fixing zn, nox, tax, and rad through transformations to minimize their target variation.

DATA PREPARATION

This data set has no missing values so no imputations here

Trying to make three datasets for preparation of this data

Original dataset with transformed [dis,lstat,age] with log
A log normalized dataset
Transform [dis,lstat,age,zn,nox,tax,rad] with both quadratic and log terms.

Here not doing any tranformation to the variables - zn,chas,target

Transforming the data

library(mice)

## Warning: package 'mice' was built under R version 4.0.5

## 
## Attaching package: 'mice'

## The following object is masked from 'package:RCurl':
## 
##     complete

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

crime_N = data.frame(
dis_n = log(crime$dis),
lstat_n = log(crime$lstat),
age_n = log(crime$age),
zn_n = crime$zn^2,
nox_n = crime$nox^2,
tax_n = crime$tax^2,
rad_n = crime$rad^2)
dataset1 = crime
dataset2 = log(crime)
dataset2$zn = crime$zn
dataset2$target = crime$target
dataset2$chas = crime$chas
dataset3 = cbind(crime,crime_N)

Build Models

1. Original dataset

This model i m including all variables this will help us understand which variables are significant and allow us to make better model. This model will be based off of the original data - before transformed

m1 = glm(data=crime,target~.,family=binomial)
summary(m1)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = crime)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8464  -0.1445  -0.0017   0.0029   3.4665  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -40.822934   6.632913  -6.155 7.53e-10 ***
## zn           -0.065946   0.034656  -1.903  0.05706 .  
## indus        -0.064614   0.047622  -1.357  0.17485    
## chas          0.910765   0.755546   1.205  0.22803    
## nox          49.122297   7.931706   6.193 5.90e-10 ***
## rm           -0.587488   0.722847  -0.813  0.41637    
## age           0.034189   0.013814   2.475  0.01333 *  
## dis           0.738660   0.230275   3.208  0.00134 ** 
## rad           0.666366   0.163152   4.084 4.42e-05 ***
## tax          -0.006171   0.002955  -2.089  0.03674 *  
## ptratio       0.402566   0.126627   3.179  0.00148 ** 
## lstat         0.045869   0.054049   0.849  0.39608    
## medv          0.180824   0.068294   2.648  0.00810 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 192.05  on 453  degrees of freedom
## AIC: 218.05
## 
## Number of Fisher Scoring iterations: 9

This original dataset model with all variables has an AIC of 214.18 with a sample over 400, meaning it may be an accurate metric over BIC.
Residual deviance is 86.15 and Null deviance of almost 646. Significant variables include [rad,nox, dis,medv, ptratio].

2. Backward elimination using only significant variables

In this model we are performing backward elimination using only significant variales we got after running model 1. Variables will be removed one by one to determine best fit model. After each variable is removed, the model will be ‘ran’ again - until the most optimal outputs are produced. Only the final output will be shown

## Start:  AIC=260.51
## target ~ rad + nox + dis + medv + ptratio
## 
##           Df Deviance    AIC
## <none>         46.307 260.51
## - dis      1   46.597 261.42
## - ptratio  1   46.622 261.67
## - medv     1   47.468 270.04
## - rad      1   51.181 305.14
## - nox      1   55.560 343.39

summary(reducedm2)

## 
## Call:
## glm(formula = target ~ rad + nox + dis + medv + ptratio, data = crime)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -0.57941  -0.19595  -0.07129   0.14505   0.89332  
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.319691   0.291629  -4.525 7.69e-06 ***
## rad          0.016581   0.002383   6.958 1.19e-11 ***
## nox          2.300490   0.239958   9.587  < 2e-16 ***
## dis         -0.019333   0.011387  -1.698 0.090229 .  
## medv         0.007108   0.002093   3.396 0.000744 ***
## ptratio      0.015801   0.008928   1.770 0.077426 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.1006676)
## 
##     Null deviance: 116.466  on 465  degrees of freedom
## Residual deviance:  46.307  on 460  degrees of freedom
## AIC: 260.51
## 
## Number of Fisher Scoring iterations: 2

This model has higher AIC value compared to model 1 but residual veviance is much lower of 46.307

3. Log transformed dataset model, all features included

m3 = glm(data=dataset2,target~.,family=binomial)
summary(m3)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = dataset2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.90738  -0.16285  -0.00161   0.08746   3.13907  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -15.96595   11.72606  -1.362  0.17333    
## zn           -0.03216    0.02687  -1.197  0.23145    
## indus         0.24931    0.52151   0.478  0.63262    
## chas          0.86893    0.73997   1.174  0.24028    
## nox          24.13637    3.91814   6.160 7.27e-10 ***
## rm            2.11032    3.70451   0.570  0.56891    
## age           0.85132    0.63686   1.337  0.18131    
## dis           2.69643    0.84395   3.195  0.00140 ** 
## rad           2.98142    0.68476   4.354 1.34e-05 ***
## tax          -1.51088    1.12043  -1.348  0.17750    
## ptratio       5.52739    2.12059   2.607  0.00915 ** 
## lstat         0.17708    0.66971   0.264  0.79146    
## medv          2.32688    1.48654   1.565  0.11751    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 206.10  on 453  degrees of freedom
## AIC: 232.1
## 
## Number of Fisher Scoring iterations: 8

Model 3 has AIC value of 232.07 and very high residual deviance value of 204

4. Log and Quad transformed datase

m4 = glm(data=dataset3,target~.,family=binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

summary(m4)

## 
## Call:
## glm(formula = target ~ ., family = binomial, data = dataset3)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1703  -0.1209   0.0000   0.0089   3.8317  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.339e+01  2.913e+01   0.460 0.645677    
## zn           1.068e-01  2.030e-01   0.526 0.598755    
## indus        2.002e-01  1.169e-01   1.712 0.086861 .  
## chas        -5.510e-01  9.694e-01  -0.568 0.569783    
## nox         -2.920e+02  1.182e+02  -2.469 0.013537 *  
## rm          -1.962e+00  1.024e+00  -1.916 0.055373 .  
## age          8.780e-02  3.888e-02   2.258 0.023925 *  
## dis         -3.540e+00  1.268e+00  -2.791 0.005251 ** 
## rad          1.150e-01  4.250e-01   0.271 0.786639    
## tax          2.332e-01  7.049e-02   3.308 0.000939 ***
## ptratio      7.702e-01  2.055e-01   3.748 0.000178 ***
## lstat        3.121e-01  1.723e-01   1.812 0.070047 .  
## medv         3.201e-01  1.105e-01   2.898 0.003757 ** 
## dis_n        1.598e+01  4.899e+00   3.263 0.001104 ** 
## lstat_n     -3.835e+00  2.219e+00  -1.728 0.083914 .  
## age_n       -1.873e+00  1.530e+00  -1.224 0.221133    
## zn_n        -4.618e-03  7.091e-03  -0.651 0.514853    
## nox_n        3.243e+02  1.136e+02   2.854 0.004315 ** 
## tax_n       -3.718e-04  1.099e-04  -3.382 0.000719 ***
## rad_n        8.662e-02  3.181e-02   2.723 0.006471 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 130.53  on 446  degrees of freedom
## AIC: 170.53
## 
## Number of Fisher Scoring iterations: 12

Model 4 has AIC value of 168.92 and residual deviance of 126.92

We are choosing here model 2 because it has lower residual deviance

Now we can run the anova() function on the model to analyze the table of deviance

anova(reducedm2, test="Chisq")

## Analysis of Deviance Table
## 
## Model: gaussian, link: identity
## 
## Response: target
## 
## Terms added sequentially (first to last)
## 
## 
##         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                      465    116.466              
## rad      1   45.948       464     70.518 < 2.2e-16 ***
## nox      1   22.356       463     48.162 < 2.2e-16 ***
## dis      1    0.693       462     47.469  0.008677 ** 
## medv     1    0.846       461     46.622  0.003736 ** 
## ptratio  1    0.315       460     46.307  0.076764 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Explaination of anova () is in the below section

SELECT MODEL

We have selected model 2 with lower residual deviance and less features. From the anova() function the difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding rad, nox and dis significantly reduces the residual deviance. The other variables seem to improve the model less. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. Ultimately what you would like to see is a significant drop in deviance and the AIC.

Lets evaluate our model

As a last step, we are going to plot the ROC curve and calculate the AUC (area under the curve) which are typical performance measurements for a binary classifier. Here is a ROC curve of our selected model and area under the curve:

crime$predict <- predict(reducedm2, crime, type='response')
myroc <- roc(crime$target, crime$predict, plot=T, asp=NA,
                legacy.axes=T, main = "ROC Curve", ret="tp", col="blue")

## Setting levels: control = 0, case = 1

## Setting direction: controls < cases

myroc["auc"]

## $auc
## Area under the curve: 0.9507

The ROC is a curve generated by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings while the AUC is the area under the ROC curve. As a rule of thumb, a model with good predictive ability should have an AUC closer to 1 (1 is ideal) than to 0.5.

TEST DATA PREDICTIONS

#Load Data
finalPred =predict(reducedm2, newdata=crime_eval)
finalPred = ifelse(finalPred<.5,0,1)
hist(finalPred)

write.csv(finalPred,'deepaksharma621_assignment_evaluation_3.csv')