DATA 621 Week 3

Introduction)

We are given a set of data containing several predictor variables detailing city attributes such as lot sizes and property values. The data contains a response variable that detailsif a crime rate is greater than the median crime rate. Each row of the data represents a city/neighborhood.

Our goal is to build a model which predicts a cities crime propensity as a function of predictior variables.

The data can be pulled from the following github URL’s. Training: https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-training-data.csv Evaluation: https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-evaluation-data.csv

Read in the Data

##    zn indus chas   nox    rm   age    dis rad tax ptratio  black lstat
## 1   0 19.58    0 0.605 7.929  96.2 2.0459   5 403    14.7 369.30  3.70
## 2   0 19.58    1 0.871 5.403 100.0 1.3216   5 403    14.7 396.90 26.82
## 3   0 18.10    0 0.740 6.485 100.0 1.9784  24 666    20.2 386.73 18.85
## 4  30  4.93    0 0.428 6.393   7.8 7.0355   6 300    16.6 374.71  5.19
## 5   0  2.46    0 0.488 7.155  92.2 2.7006   3 193    17.8 394.12  4.82
## 6   0  8.56    0 0.520 6.781  71.3 2.8561   5 384    20.9 395.58  7.67
## 7   0 18.10    0 0.693 5.453 100.0 1.4896  24 666    20.2 396.90 30.59
## 8   0 18.10    0 0.693 4.519 100.0 1.6582  24 666    20.2  88.27 36.98
## 9   0  5.19    0 0.515 6.316  38.1 6.4584   5 224    20.2 389.71  5.68
## 10 80  3.64    0 0.392 5.876  19.1 9.2203   1 315    16.4 395.18  9.25
##    medv target
## 1  50.0      1
## 2  13.4      1
## 3  15.4      1
## 4  23.7      0
## 5  37.9      0
## 6  26.5      0
## 7   5.0      1
## 8   7.0      1
## 9  22.2      0
## 10 20.9      0

Any transformations or alterations to any of the variables will be done to both the training and evaluation dataset.

Data Exploration How many records and columns does our data have?

## [1] 466

## [1] 14

##  [1] "zn"      "indus"   "chas"    "nox"     "rm"      "age"     "dis"    
##  [8] "rad"     "tax"     "ptratio" "black"   "lstat"   "medv"    "target"

The data contains 466 rows with each row belonging to a city/neighborhood. There are 14 variables,some continous, discrete, and coded categorical. The variable type plays a major role when it comes to EDA and modeling. our response variable is discrete containing a 1 or a 0. 1 means the crime rate is over the median and 0 means the crime rate is below the median.

The variables are defined as follows:

zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable) indus: proportion of non-retail business acres per suburb (predictor variable) chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable) nox: nitrogen oxides concentration (parts per 10 million) (predictor variable) rm: average number of rooms per dwelling (predictor variable) age: proportion of owner-occupied units built prior to 1940 (predictor variable) dis: weighted mean of distances to five Boston employment centers (predictor variable) rad: index of accessibility to radial highways (predictor variable) tax: full-value property-tax rate per $10,000 (predictor variable) ptratio: pupil-teacher ratio by town (predictor variable) black: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town (predictor variable) lstat: lower status of the population (percent) (predictor variable) medv: median value of owner-occupied homes in $1000s (predictor variable) target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)

Since we are dealing with coded categorical values in some instances, we need to think of EDA based off frequency rather than distributions on an x,yplot. What are the number of instances for each coded categorical or proportion variable?

Target(response)

## 
##   0   1 
## 237 229

There are slightly more instances where the crime rate is less than the median.

Chas

## 
##   0   1 
## 433  33

There are much fewer towns that border the Charles river.

Age

Indus

As for the other variables, we are able to plot their spread and descriptive statistics.

##       nox               rm             dis              rad       
##  Min.   :0.3890   Min.   :3.863   Min.   : 1.130   Min.   : 1.00  
##  1st Qu.:0.4480   1st Qu.:5.887   1st Qu.: 2.101   1st Qu.: 4.00  
##  Median :0.5380   Median :6.210   Median : 3.191   Median : 5.00  
##  Mean   :0.5543   Mean   :6.291   Mean   : 3.796   Mean   : 9.53  
##  3rd Qu.:0.6240   3rd Qu.:6.630   3rd Qu.: 5.215   3rd Qu.:24.00  
##  Max.   :0.8710   Max.   :8.780   Max.   :12.127   Max.   :24.00  
##       tax           ptratio         black            lstat       
##  Min.   :187.0   Min.   :12.6   Min.   :  0.32   Min.   : 1.730  
##  1st Qu.:281.0   1st Qu.:16.9   1st Qu.:375.61   1st Qu.: 7.043  
##  Median :334.5   Median :18.9   Median :391.34   Median :11.350  
##  Mean   :409.5   Mean   :18.4   Mean   :357.12   Mean   :12.631  
##  3rd Qu.:666.0   3rd Qu.:20.2   3rd Qu.:396.24   3rd Qu.:16.930  
##  Max.   :711.0   Max.   :22.0   Max.   :396.90   Max.   :37.970  
##       medv      
##  Min.   : 5.00  
##  1st Qu.:17.02  
##  Median :21.20  
##  Mean   :22.59  
##  3rd Qu.:25.00  
##  Max.   :50.00

The plot of the spreads give us information regarding what variables could contain extreme values. Of course we would need a deeper investigation to determine if the extreme values are cause of error or totally plausible. We know what to look out for furthur into EDA. Variables such as dis and ptratio have visible skews.

How does each variable correlate to the response variable?

##          zn       indus        chas         nox          rm         age 
## -0.43168176  0.60485074  0.08004187  0.72610622 -0.15255334  0.63010625 
##         dis         rad         tax     ptratio       black       lstat 
## -0.61867312  0.62810492  0.61111331  0.25084892 -0.35295680  0.46912702 
##        medv      target 
## -0.27055071  1.00000000

There are some variables that are alarming. Indus, nox,age, rad, and tax have a strong positive correlation with the response variable while zn and dis have a strong negative correlation. How do the rest of variables correlate with each other?

## 
## Attaching package: 'reshape2'

## The following object is masked from 'package:tidyr':
## 
##     smiths

From the correlation heatmap, chas stands out in the sense that it does not have much correlation with any of the other variables. It might be a good candidate for removal.

Nox has a high positive correlation with target. Nox represents nitrogen oxides concentration. This correlation is not so intuitive. On the outside what does nitrogen concentration have to do if a neighborhood/city has a crime rate above the median? Without any context such as pollution or air quality, it might be difficult to see the connection. Nox also has a very strong negative correlation with dis. Dis represents the weighted mean of distances to five Boston employment center. This correlation is more intuitive to infer. A employment center perhaps could mean the presense of industry including factories.

Rad and tax have a strong positive correlation 0.91. Rad measures the index of accessibility to radial highway while tax measures full-value property-tax rate per $10,00. This correlation is moe intuitive. Areas with high taxes usually have better infrastructure including number of roads.

The high correlation examples highlighted might imply that they could be removed from the model but of course this can only be verified through diagnostics and quality of output accuracy. Chas is the other variable that could potentially be removed due to the minimal correlation with any other variable.

Data Preperation How many entries are missing for each variable?

##      zn   indus    chas     nox      rm     age     dis     rad     tax 
##       0       0       0       0       0       0       0       0       0 
## ptratio   black   lstat    medv  target 
##       0       0       0       0       0

##      zn   indus    chas     nox      rm     age     dis     rad     tax 
##       0       0       0       0       0       0       0       0       0 
## ptratio   black   lstat    medv 
##       0       0       0       0

The data is complete in both the training and evaluation sets.

Unlike linear regression models, there are certain assumptions that do not to be satisfied. Such assumptions are the normality of residuals, independence of predictor variables, and minimum sample size. Since the response variable is binary, it does not make sense to apply a transform to the response variable. In the later sections, we will be implimenting a family of the GLM which provides more felixibility.

In our EDA, we noticed that some variables were proportions. These variables are good candidates for dummy coding if needed. The only one that stands out is the black variable. The black variable is defined as 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town. We can reverse engineer this value to make the proportions look more like the other variables with proportions.

\[ BK=1000\times(bk-0.63)^2 \]

Divide both sides by 1000, take the square root and add 0.63 \[ \sqrt{\frac{1000}{BK}} + 0.63=bk \]

Lets transform this variable using the derived equation

##   zn indus chas   nox    rm  age    dis rad tax ptratio black lstat medv
## 1  0 19.58    0 0.605 7.929 96.2 2.0459   5 403    14.7 369.3   3.7   50
##   target bk_transform
## 1      1     2.275547

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.217   2.219   2.229   2.749   2.262  56.532

I do not believe that we need to transform any other variable mainly because of the type of modeling we aim to perform. During the model building phase, we will have a better idea if we should use interactive terms or drop terms all together. Since the response variable is a binary operator, it will be left unchanged.

Build Models We want to build a logitistc regression model that predicts the crime porpensity based on crime rate compared to median crime rate. For each model, we will also display potential influential points, model summary, VIF numbers, and standard residual plot. We

Our first model takes all the variables except the transformed black variable

## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──

## ✔ tibble  1.4.2     ✔ dplyr   0.7.4
## ✔ readr   1.1.1     ✔ stringr 1.3.0
## ✔ purrr   0.2.4     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

## Warning: package 'broom' was built under R version 3.4.4

## Warning: package 'car' was built under R version 3.4.4

## Loading required package: carData

## Warning: package 'carData' was built under R version 3.4.4

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

## 
## Call:
## glm(formula = target ~ . - bk_transform, family = binomial, data = crime_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2854  -0.1372  -0.0017   0.0020   3.4721  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -36.839521   7.028726  -5.241 1.59e-07 ***
## zn           -0.061720   0.034410  -1.794 0.072868 .  
## indus        -0.072580   0.048546  -1.495 0.134894    
## chas          1.032352   0.759627   1.359 0.174139    
## nox          50.159513   8.049503   6.231 4.62e-10 ***
## rm           -0.692145   0.741431  -0.934 0.350548    
## age           0.034522   0.013883   2.487 0.012895 *  
## dis           0.765795   0.234407   3.267 0.001087 ** 
## rad           0.663015   0.165135   4.015 5.94e-05 ***
## tax          -0.006593   0.003064  -2.152 0.031422 *  
## ptratio       0.442217   0.132234   3.344 0.000825 ***
## black        -0.013094   0.006680  -1.960 0.049974 *  
## lstat         0.047571   0.054508   0.873 0.382802    
## medv          0.199734   0.071022   2.812 0.004919 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 186.15  on 452  degrees of freedom
## AIC: 214.15
## 
## Number of Fisher Scoring iterations: 9

##       zn    indus     chas      nox       rm      age      dis      rad 
## 1.846737 2.699929 1.262505 4.296876 6.070541 2.556302 4.025192 1.955112 
##      tax  ptratio    black    lstat     medv 
## 2.174569 2.340600 1.091380 2.681384 8.567935

##   target zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1      1 22  5.86    0 0.431 8.259  8.4 8.9067   7 330    19.1 396.90
## 2      0  0  8.56    0 0.520 6.405 85.4 2.7147   5 384    20.9  70.80
## 3      1  0 10.59    0 0.489 5.412  9.8 3.5875   4 277    18.6 348.93
##   lstat medv bk_transform   .fitted  .se.fit    .resid      .hat    .sigma
## 1  3.54 42.8     2.217302 -1.177843 1.119924  1.700763 0.2257719 0.6359713
## 2 10.63 18.6     4.388230  2.535283 2.142840 -2.285406 0.3123801 0.6292069
## 3 29.55 23.7     2.322898 -3.118805 1.446039  2.514782 0.0847824 0.6304143
##      .cooksd .std.resid index
## 1 0.08736447   1.932900    14
## 2 0.59554988  -2.756063   159
## 3 0.16353423   2.628683   457

##   target zn indus chas   nox    rm  age   dis rad tax ptratio  black lstat
## 1      1 20  6.96    0 0.464 5.856 42.1 4.429   3 223    18.6 388.65    13
##   medv bk_transform   .fitted   .se.fit   .resid        .hat    .sigma
## 1 21.1      2.23406 -6.025248 0.9877586 3.472078 0.002346817 0.6212491
##      .cooksd .std.resid index
## 1 0.06968255   3.476159   338

Before we make an assesment on the significance of predictors, lets build a model with the transformed black variable instead of the original variable.

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

## 
## Call:
## glm(formula = target ~ . - black, family = binomial, data = crime_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8263  -0.1428  -0.0016   0.0021   3.4773  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -43.094274   7.092912  -6.076 1.23e-09 ***
## zn            -0.065972   0.034845  -1.893  0.05832 .  
## indus         -0.065205   0.047752  -1.365  0.17210    
## chas           0.933563   0.755711   1.235  0.21670    
## nox           49.674892   8.001002   6.209 5.35e-10 ***
## rm            -0.642784   0.728322  -0.883  0.37748    
## age            0.034300   0.013870   2.473  0.01340 *  
## dis            0.757860   0.232801   3.255  0.00113 ** 
## rad            0.668084   0.164127   4.071 4.69e-05 ***
## tax           -0.006311   0.002988  -2.112  0.03467 *  
## ptratio        0.408886   0.127689   3.202  0.00136 ** 
## lstat          0.047660   0.054204   0.879  0.37926    
## medv           0.188734   0.069436   2.718  0.00657 ** 
## bk_transform   0.865903   0.903271   0.959  0.33775    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 191.15  on 452  degrees of freedom
## AIC: 219.15
## 
## Number of Fisher Scoring iterations: 9

##           zn        indus         chas          nox           rm 
##     1.833812     2.674167     1.247015     4.225539     5.890692 
##          age          dis          rad          tax      ptratio 
##     2.583552     3.969640     1.955796     2.150534     2.299080 
##        lstat         medv bk_transform 
##     2.650716     8.347214     1.046181

##   target zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1      1 22  5.86    0 0.431 8.259  8.4 8.9067   7 330    19.1 396.90
## 2      0  0  8.56    0 0.520 6.405 85.4 2.7147   5 384    20.9  70.80
## 3      1  0 10.59    0 0.489 5.412  9.8 3.5875   4 277    18.6 348.93
##   lstat medv bk_transform    .fitted  .se.fit    .resid       .hat
## 1  3.54 42.8     2.217302 -1.2184250 1.119548  1.718995 0.22076293
## 2 10.63 18.6     4.388230  0.3274957 1.898050 -1.319274 0.87694457
## 3 29.55 23.7     2.322898 -3.4954511 1.442867  2.655310 0.05948963
##      .sigma    .cooksd .std.resid index
## 1 0.6445302 0.08782419   1.947332    14
## 2 0.6264712 5.73947492  -3.760839   159
## 3 0.6381263 0.15835842   2.738000   457

##   target zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1      0  0  8.56    0 0.520 6.405 85.4 2.7147   5 384    20.9  70.80
## 2      1 20  6.96    0 0.464 5.856 42.1 4.4290   3 223    18.6 388.65
##   lstat medv bk_transform    .fitted   .se.fit    .resid        .hat
## 1 10.63 18.6      4.38823  0.3274957 1.8980499 -1.319274 0.876944574
## 2 13.00 21.1      2.23406 -6.0434942 0.9907719  3.477316 0.002318712
##      .sigma    .cooksd .std.resid index
## 1 0.6264712 5.73947492  -3.760839   159
## 2 0.6300430 0.07011183   3.481355   338

A shocking development arose when looking at the model using the transformed black variable. It seems that the transformed black variable causes the model to contain a variable that perfectly differentiates between 0’1 and 1s. This is caused to to a bias within the predictors. We will retain the unchanged black transform variable and use the AIC and deviance from mod1 as our benchmark .

We should consider removing the chas variable. During the EDA step, we found evidence that chas has a low correlation with any of the other variables. It also has a high p-value

## 
## Call:
## glm(formula = target ~ . - bk_transform - chas, family = binomial, 
##     data = crime_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2099  -0.1356  -0.0013   0.0018   3.4691  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -35.393661   6.866186  -5.155 2.54e-07 ***
## zn           -0.069129   0.034604  -1.998  0.04575 *  
## indus        -0.057736   0.046446  -1.243  0.21384    
## nox          48.003143   7.781466   6.169 6.88e-10 ***
## rm           -0.703120   0.738498  -0.952  0.34105    
## age           0.035141   0.013766   2.553  0.01069 *  
## dis           0.738029   0.231466   3.189  0.00143 ** 
## rad           0.720422   0.163465   4.407 1.05e-05 ***
## tax          -0.007419   0.003002  -2.472  0.01345 *  
## ptratio       0.409562   0.128375   3.190  0.00142 ** 
## black        -0.012478   0.006486  -1.924  0.05438 .  
## lstat         0.057059   0.053145   1.074  0.28298    
## medv          0.201062   0.070966   2.833  0.00461 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 188.02  on 453  degrees of freedom
## AIC: 214.02
## 
## Number of Fisher Scoring iterations: 9

##       zn    indus      nox       rm      age      dis      rad      tax 
## 1.829739 2.522200 4.044732 5.792163 2.521459 3.974603 1.932627 2.073288 
##  ptratio    black    lstat     medv 
## 2.240897 1.078848 2.512986 8.305521

##   target zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1      1 22  5.86    0 0.431 8.259  8.4 8.9067   7 330    19.1 396.90
## 2      0  0  8.56    0 0.520 6.405 85.4 2.7147   5 384    20.9  70.80
## 3      1  0 10.59    0 0.489 5.412  9.8 3.5875   4 277    18.6 348.93
##   lstat medv bk_transform    .fitted  .se.fit    .resid       .hat
## 1  3.54 42.8     2.217302 -1.2184250 1.119548  1.718995 0.22076293
## 2 10.63 18.6     4.388230  0.3274957 1.898050 -1.319274 0.87694457
## 3 29.55 23.7     2.322898 -3.4954511 1.442867  2.655310 0.05948963
##      .sigma    .cooksd .std.resid index
## 1 0.6445302 0.08782419   1.947332    14
## 2 0.6264712 5.73947492  -3.760839   159
## 3 0.6381263 0.15835842   2.738000   457

##   target zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1      0  0  8.56    0 0.520 6.405 85.4 2.7147   5 384    20.9  70.80
## 2      1 20  6.96    0 0.464 5.856 42.1 4.4290   3 223    18.6 388.65
##   lstat medv bk_transform    .fitted   .se.fit    .resid        .hat
## 1 10.63 18.6      4.38823  0.3274957 1.8980499 -1.319274 0.876944574
## 2 13.00 21.1      2.23406 -6.0434942 0.9907719  3.477316 0.002318712
##      .sigma    .cooksd .std.resid index
## 1 0.6264712 5.73947492  -3.760839   159
## 2 0.6300430 0.07011183   3.481355   338

We can build additional models through the use of backstepping methods and use anova to differentiate the significance betwee models

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## target ~ (zn + indus + chas + nox + rm + age + dis + rad + tax + 
##     ptratio + black + lstat + medv + bk_transform) - bk_transform - 
##     chas
## 
## Final Model:
## target ~ zn + nox + age + dis + rad + tax + ptratio + black + 
##     lstat + medv
## 
## 
##      Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                            453   188.0190 214.0190
## 2    - rm  1 0.9131359       454   188.9321 212.9321
## 3 - indus  1 1.5819995       455   190.5141 212.5141

AIC backstepping removed the rm predictor and the indus predictor. Both of these predictors had high p values in the full model.

Lets formulate the model from AIC backstep.

## 
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio + 
##     black + lstat + medv, family = binomial, data = crime_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1610  -0.1556  -0.0017   0.0019   3.3921  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -34.831121   6.585833  -5.289 1.23e-07 ***
## zn           -0.070361   0.032981  -2.133 0.032891 *  
## nox          43.191700   6.752210   6.397 1.59e-10 ***
## age           0.027461   0.011238   2.444 0.014544 *  
## dis           0.671699   0.216605   3.101 0.001929 ** 
## rad           0.743199   0.154146   4.821 1.43e-06 ***
## tax          -0.008639   0.002758  -3.133 0.001732 ** 
## ptratio       0.353646   0.115275   3.068 0.002156 ** 
## black        -0.011777   0.006317  -1.864 0.062280 .  
## lstat         0.068969   0.048236   1.430 0.152769    
## medv          0.147677   0.042484   3.476 0.000509 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 190.51  on 455  degrees of freedom
## AIC: 212.51
## 
## Number of Fisher Scoring iterations: 9

##       zn      nox      age      dis      rad      tax  ptratio    black 
## 1.848232 3.198787 1.784884 3.624938 1.690236 1.822941 1.889717 1.064965 
##    lstat     medv 
## 2.121835 3.021578

##   target zn   nox  age    dis rad tax ptratio  black lstat medv   .fitted
## 1      0  0 0.520 85.4 2.7147   5 384    20.9  70.80 10.63 18.6  2.233041
## 2      1 20 0.464 42.1 4.4290   3 223    18.6 388.65 13.00 21.1 -5.750074
## 3      1  0 0.489  9.8 3.5875   4 277    18.6 348.93 29.55 23.7 -2.445420
##     .se.fit    .resid        .hat    .sigma    .cooksd .std.resid index
## 1 2.0078998 -2.160961 0.352559871 0.6354115 0.71324454  -2.685636   159
## 2 0.9463457  3.392124 0.002832146 0.6278678 0.08136015   3.396937   338
## 3 1.3230931  2.248802 0.128510226 0.6378504 0.17744046   2.408907   457

##   target zn   nox  age   dis rad tax ptratio  black lstat medv   .fitted
## 1      1 20 0.464 42.1 4.429   3 223    18.6 388.65    13 21.1 -5.750074
##     .se.fit   .resid        .hat    .sigma    .cooksd .std.resid index
## 1 0.9463457 3.392124 0.002832146 0.6278678 0.08136015   3.396937   338

We can also use the bestglm package to narrow down viable models using several goodness of fit criteria. We would need to change the data frame such that the last column contains the response variable.

## Warning: package 'bestglm' was built under R version 3.4.4

## Morgan-Tatar search since family is non-gaussian.

## AIC
## BICq equivalent for q in (0.88551381531015, 0.897513637623221)
## Best Model:
##                 Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept) -34.83112133 6.585833395 -5.288795 1.231249e-07
## medv          0.14767672 0.042483810  3.476070 5.088187e-04
## lstat         0.06896872 0.048235949  1.429820 1.527687e-01
## black        -0.01177713 0.006317191 -1.864299 6.227974e-02
## ptratio       0.35364572 0.115275321  3.067835 2.156153e-03
## tax          -0.00863917 0.002757684 -3.132763 1.731693e-03
## rad           0.74319887 0.154146474  4.821381 1.425678e-06
## dis           0.67169851 0.216605206  3.101027 1.928510e-03
## age           0.02746097 0.011238231  2.443531 1.454431e-02
## nox          43.19170029 6.752210255  6.396676 1.587959e-10
## zn           -0.07036110 0.032980585 -2.133410 3.289113e-02

Formulate the model built from the best GLM package in addition to removing non significant predictors

## 
## Call:
## glm(formula = target ~ . - indus - chas - rm - bk_transform - 
##     black, family = binomial, data = crime_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9279  -0.1640  -0.0016   0.0027   3.3989  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -38.786168   6.161170  -6.295 3.07e-10 ***
## zn           -0.072933   0.033022  -2.209 0.027199 *  
## nox          43.014846   6.689945   6.430 1.28e-10 ***
## age           0.028379   0.011401   2.489 0.012809 *  
## dis           0.658786   0.214787   3.067 0.002161 ** 
## rad           0.739415   0.152375   4.853 1.22e-06 ***
## tax          -0.008045   0.002651  -3.035 0.002408 ** 
## ptratio       0.334196   0.111780   2.990 0.002792 ** 
## lstat         0.064814   0.048427   1.338 0.180769    
## medv          0.138257   0.041630   3.321 0.000897 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 195.51  on 456  degrees of freedom
## AIC: 215.51
## 
## Number of Fisher Scoring iterations: 9

##       zn      nox      age      dis      rad      tax  ptratio    lstat 
## 1.817650 3.146723 1.840891 3.564748 1.681321 1.775775 1.871613 2.133702 
##     medv 
## 2.975434

##   target zn indus chas   nox    rm  age    dis rad tax ptratio  black
## 1      1 22  5.86    0 0.431 8.259  8.4 8.9067   7 330    19.1 396.90
## 2      1 20  6.96    0 0.464 5.856 42.1 4.4290   3 223    18.6 388.65
## 3      1  0 10.59    0 0.489 5.412  9.8 3.5875   4 277    18.6 348.93
##   lstat medv bk_transform    .fitted   .se.fit   .resid        .hat
## 1  3.54 42.8     2.217302 -0.6941428 0.9919713 1.482752 0.218595632
## 2 13.00 21.1     2.234060 -5.7732971 0.9456040 3.398942 0.002763196
## 3 29.55 23.7     2.322898 -2.9731063 1.3334338 2.458856 0.082302853
##      .sigma    .cooksd .std.resid index
## 1 0.6507833 0.07167244   1.677376    14
## 2 0.6357999 0.08935649   3.403647   338
## 3 0.6443780 0.19108200   2.566749   457

##   target zn indus chas   nox    rm  age   dis rad tax ptratio  black lstat
## 1      1 20  6.96    0 0.464 5.856 42.1 4.429   3 223    18.6 388.65    13
##   medv bk_transform   .fitted  .se.fit   .resid        .hat    .sigma
## 1 21.1      2.23406 -5.773297 0.945604 3.398942 0.002763196 0.6357999
##      .cooksd .std.resid index
## 1 0.08935649   3.403647   338

IV Model Selection) We built a total of 5 models with different combinations of predictors. This section, we define the methodology used to select the best model to use on the evaluation data based on certain key metrics that we will define later on in this study.

For each model we produced an examination of cook’s distance, standard residuals, variance inflation numbers, and model summary. Based on a combination of these diagnostics, I decided to go with model 5. Model 5 as an AIC of 212.51, which is marginally different than the AIC produced by the other models. Model 5 is less complex and does not have any predictors with high p values.

Psuedo R square value for logisitc regression

## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis

##          llh      llhNull           G2     McFadden         r2ML 
##  -97.7573458 -322.9379132  450.3611349    0.6972875    0.6195651 
##         r2CU 
##    0.8261680

McFadden’s R square value is used as the linear regression r square but for logistic regression instead. The McFadden’s statistic of 0.697 indicates that model 5 has decent predictive power on a 0 to 1 scale.

We will take this model and apply it to our testing data

Generate a confusion matrix

library(pROC)

## Warning: package 'pROC' was built under R version 3.4.4

## Type 'citation("pROC")' for a citation.

## 
## Attaching package: 'pROC'

## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var

library(caret)

names(classes.df)<-c("class", "scored.class", "scored.probability")

matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2

##             class
## scored.class  1  0
##            1 62  7
##            0  6 64

Use the functionality of the caret package to compute relevant statistics from the confusion matrix

## Confusion Matrix and Statistics
## 
##             class
## scored.class  1  0
##            1 62  7
##            0  6 64
##                                           
##                Accuracy : 0.9065          
##                  95% CI : (0.8454, 0.9493)
##     No Information Rate : 0.5108          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8129          
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9118          
##             Specificity : 0.9014          
##          Pos Pred Value : 0.8986          
##          Neg Pred Value : 0.9143          
##              Prevalence : 0.4892          
##          Detection Rate : 0.4460          
##    Detection Prevalence : 0.4964          
##       Balanced Accuracy : 0.9066          
##                                           
##        'Positive' Class : 1               
##

## [1] 0.9051095

## Area under the curve: 0.966

Model 5 has a high accuracy and high F1 score. The AUC is 0.9585. We have more than enough evidence to conclude that model 5 is the best model to go with.

Interpretation) Now that the best model has been picked, what does the model actually mean in the sense of crime propensity?

## 
## Call:
## glm(formula = target ~ . - indus - chas - rm - bk_transform - 
##     black, family = binomial, data = crime_training)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9279  -0.1640  -0.0016   0.0027   3.3989  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -38.786168   6.161170  -6.295 3.07e-10 ***
## zn           -0.072933   0.033022  -2.209 0.027199 *  
## nox          43.014846   6.689945   6.430 1.28e-10 ***
## age           0.028379   0.011401   2.489 0.012809 *  
## dis           0.658786   0.214787   3.067 0.002161 ** 
## rad           0.739415   0.152375   4.853 1.22e-06 ***
## tax          -0.008045   0.002651  -3.035 0.002408 ** 
## ptratio       0.334196   0.111780   2.990 0.002792 ** 
## lstat         0.064814   0.048427   1.338 0.180769    
## medv          0.138257   0.041630   3.321 0.000897 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 645.88  on 465  degrees of freedom
## Residual deviance: 195.51  on 456  degrees of freedom
## AIC: 215.51
## 
## Number of Fisher Scoring iterations: 9

We can also convert coefficients to odds ratios

##               Odds ratio        2.5 %       97.5 %
## (Intercept) 1.430149e-17 8.146750e-23 2.510606e-12
## zn          9.296628e-01 8.713995e-01 9.918218e-01
## nox         4.798551e+18 9.696589e+12 2.374659e+24
## age         1.028785e+00 1.006050e+00 1.052033e+00
## dis         1.932445e+00 1.268469e+00 2.943979e+00
## rad         2.094711e+00 1.553896e+00 2.823750e+00
## tax         9.919877e-01 9.868469e-01 9.971552e-01
## ptratio     1.396817e+00 1.121999e+00 1.738947e+00
## lstat       1.066961e+00 9.703478e-01 1.173193e+00
## medv        1.148270e+00 1.058299e+00 1.245891e+00

The increase in logit score per unit increase is high relative to the other coefficients for medv, ptratio, and nox. We see that zn and tax have a negative effect on the crime rate being greater than the median.

Appendix:

crime_training <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-training-data.csv"), header =TRUE)

crime_evaluation <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-evaluation-data.csv"), header =TRUE)

head(crime_training, 10)
nrow(crime_training);ncol(crime_training);names(crime_training)
table(crime_training$target);

barplot(table(crime_training$target), ylim=c(0, 500), xlab="Result", ylab="N", col="black",
        main="Distribution of Target(Response")
table(crime_training$chas);

barplot(table(crime_training$chas), ylim=c(0, 500), xlab="Result", ylab="N", col="black",
        main="Distribution of Chas")
#table(crime_training$age);

barplot(table(crime_training$age), ylim=c(0, 50), xlab="Result", ylab="N", col="black",
        main="Distribution of Age")
#table(crime_training$age);

barplot(table(crime_training$zn), ylim=c(0, 150), xlab="Result", ylab="N", col="black",
        main="Distribution of Zn")
#table(crime_training$age);

barplot(table(crime_training$indus), ylim=c(0, 150), xlab="Result", ylab="N", col="black",
        main="Distribution of Indus")
library(ggplot2)
library(tidyr)

crime_training_plot_subset <- subset(crime_training, select = -c(chas, target, age, zn,indus))

ggplot(gather(crime_training_plot_subset), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x');summary(crime_training_plot_subset)
apply(crime_training,2, function(col)cor(col, crime_training$target))

#correlation matrix and visualization 
correlation_matrix <- round(cor(crime_training),2)

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
  }
  
  upper_tri <- get_upper_tri(correlation_matrix)



library(reshape2)

# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


#add nice labels 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwimoneyball_training2h = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))

#missing data 
colSums(is.na(crime_training));colSums(is.na(crime_evaluation))
crime_training$bk_transform=sqrt((1000/crime_training$black))+0.63
crime_evaluation$bk_transform=sqrt((1000/crime_evaluation$black))+0.63
head(crime_training, 1);summary(crime_training$bk_transform)
library(tidyverse)
library(dplyr)
library(broom)
library(car)

mod1<-glm(target~.-bk_transform,  family=binomial, data=crime_training)

summary(mod1);

plot(mod1, which = 4, id.n = 3);

mod1.data <- augment(mod1) %>% 
  mutate(index = 1:n()) ;

car::vif(mod1)

mod1.data %>% top_n(3, .cooksd);

ggplot(mod1.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw();

mod1.data %>% 
  filter(abs(.std.resid) > 3)

mod2<-glm(target~.-black,  family=binomial, data=crime_training)
summary(mod2);

plot(mod2, which = 4, id.n = 3);

mod2.data <- augment(mod2) %>% 
  mutate(index = 1:n()) ;

car::vif(mod2)

mod2.data %>% top_n(3, .cooksd);

ggplot(mod2.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw();

mod2.data %>% 
  filter(abs(.std.resid) > 3)

mod3<-glm(target~.-bk_transform-chas,  family=binomial, data=crime_training)
summary(mod3);

plot(mod3, which = 4, id.n = 3);

mod3.data <- augment(mod2) %>% 
  mutate(index = 1:n()) ;

car::vif(mod3)

mod3.data %>% top_n(3, .cooksd);

ggplot(mod3.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw();

mod3.data %>% 
  filter(abs(.std.resid) > 3)
library(MASS)
step<-stepAIC(mod3, trace=FALSE)
step$anova
mod4<-glm(target ~ zn + nox + age + dis + rad + tax + ptratio + black + lstat + medv,  family=binomial, data=crime_training)
summary(mod4);

plot(mod4, which = 4, id.n = 3);

mod4.data <- augment(mod4) %>% 
  mutate(index = 1:n()) ;

car::vif(mod4)

mod4.data %>% top_n(3, .cooksd);

ggplot(mod4.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw();

mod4.data %>% 
  filter(abs(.std.resid) > 3)
Xy<-crime_training_switch<-subset(crime_training, select=c(medv, lstat, black, ptratio, tax, rad, dis, age, rm, nox, chas, indus, zn , target))
library(leaps)
library(bestglm)
best_mod<-bestglm(Xy, IC="AIC", family=binomial)
best_mod
mod5<-glm(target~.-indus-chas-rm-bk_transform-black, family=binomial, data=crime_training)
summary(mod5);

plot(mod5, which = 4, id.n = 3);

mod5.data <- augment(mod5) %>% 
  mutate(index = 1:n()) ;

car::vif(mod5)

mod5.data %>% top_n(3, .cooksd);

ggplot(mod5.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw();

mod5.data %>% 
  filter(abs(.std.resid) > 3)
library(pscl)
pR2(mod5)
#The evaluation data does not contain the target response variable. To test predictive power, we need to partition our training data into a test and evaluation 
library(caret)
Train <- createDataPartition(crime_training$target, p=0.7, list=FALSE)
train <- crime_training[Train, ]
test <- crime_training[-Train, ]


#probabilities of prediction 
pred <- predict(mod5, newdata=test, type="response", na.action = na.pass)
scored.probability<-data.frame(as.double(pred))


#probability threshold 
pred_filter <- ifelse(pred > 0.5, 1, 0)
scored.class<-data.frame(as.integer(pred_filter))

#subset actuals
class<-data.frame(subset(test, select=c(target)))
class<-as.integer(class$target)

#library(plyr)
classes.df<-data.frame(class, scored.class, scored.probability)
str(classes.df)
names(classes.df)
library(pROC)
library(caret)

names(classes.df)<-c("class", "scored.class", "scored.probability")

matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2

#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix;

f1_reduced<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1) 
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (2*TP)/(2*TP + Fn + FP)
}
f1_reduced(classes.df);

plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");

auc(roc(classes.df$class, classes.df$scored.probability))
summary(mod5)
#exp(coefficients(mod5))
exp(cbind("Odds ratio" = coef(mod5), confint.default(mod5, level = 0.95)))
crime_training <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-training-data.csv"), header =TRUE)

crime_evaluation <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-evaluation-data.csv"), header =TRUE)

head(crime_training, 10)


nrow(crime_training);ncol(crime_training);names(crime_training)


table(crime_training$target);

barplot(table(crime_training$target), ylim=c(0, 500), xlab="Result", ylab="N", col="black",
        main="Distribution of Target(Response")

    table(crime_training$chas);

barplot(table(crime_training$chas), ylim=c(0, 500), xlab="Result", ylab="N", col="black",
        main="Distribution of Chas")

        #table(crime_training$age);

barplot(table(crime_training$age), ylim=c(0, 50), xlab="Result", ylab="N", col="black",
        main="Distribution of Age")


#table(crime_training$age);

barplot(table(crime_training$zn), ylim=c(0, 150), xlab="Result", ylab="N", col="black",
        main="Distribution of Zn")


#table(crime_training$age);

barplot(table(crime_training$indus), ylim=c(0, 150), xlab="Result", ylab="N", col="black",
        main="Distribution of Indus")


library(ggplot2)
library(tidyr)

crime_training_plot_subset <- subset(crime_training, select = -c(chas, target, age, zn,indus))

ggplot(gather(crime_training_plot_subset), aes(value)) + 
    geom_histogram(bins = 10) + 
    facet_wrap(~key, scales = 'free_x');summary(crime_training_plot_subset)

    apply(crime_training,2, function(col)cor(col, crime_training$target))



#correlation matrix and visualization 
correlation_matrix <- round(cor(crime_training),2)

# Get lower triangle of the correlation matrix
  get_lower_tri<-function(correlation_matrix){
    correlation_matrix[upper.tri(correlation_matrix)] <- NA
    return(correlation_matrix)
  }
  # Get upper triangle of the correlation matrix
  get_upper_tri <- function(correlation_matrix){
    correlation_matrix[lower.tri(correlation_matrix)]<- NA
    return(correlation_matrix)
  }
  
  upper_tri <- get_upper_tri(correlation_matrix)



library(reshape2)

# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)

# Heatmap
library(ggplot2)

ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
 geom_tile(color = "white")+
 scale_fill_gradient2(low = "blue", high = "red", mid = "white", 
   midpoint = 0, limit = c(-1,1), space = "Lab", 
   name="Pearson\nCorrelation") +
  theme_minimal()+ 
 theme(axis.text.x = element_text(angle = 45, vjust = 1, 
    size = 15, hjust = 1))+
 coord_fixed()


#add nice labels 
ggheatmap + 
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
  axis.title.x = element_blank(),
  axis.title.y = element_blank(),
  axis.text.x=element_text(size=rel(0.8), angle=90),
  axis.text.y=element_text(size=rel(0.8)),
  panel.grid.major = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  axis.ticks = element_blank(),
  legend.justification = c(1, 0),
  legend.position = c(0.6, 0.7),
  legend.direction = "horizontal")+
  guides(fill = guide_colorbar(barwimoneyball_training2h = 7, barheight = 1,
                title.position = "top", title.hjust = 0.5))



colSums(is.na(crime_training));colSums(is.na(crime_evaluation))



crime_training$bk_transform=sqrt((1000/crime_training$black))+0.63
crime_evaluation$bk_transform=sqrt((1000/crime_evaluation$black))+0.63
head(crime_training, 1);summary(crime_training$bk_transform)




library(tidyverse)
library(dplyr)
library(broom)
library(car)

mod1<-glm(target~.-bk_transform,  family=binomial, data=crime_training)

summary(mod1);

plot(mod1, which = 4, id.n = 3);

mod1.data <- augment(mod1) %>% 
  mutate(index = 1:n()) ;

car::vif(mod1)

mod1.data %>% top_n(3, .cooksd);

ggplot(mod1.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw();

mod1.data %>% 
  filter(abs(.std.resid) > 3)



  mod2<-glm(target~.-black,  family=binomial, data=crime_training)
summary(mod2);

plot(mod2, which = 4, id.n = 3);

mod2.data <- augment(mod2) %>% 
  mutate(index = 1:n()) ;

car::vif(mod2)

mod2.data %>% top_n(3, .cooksd);

ggplot(mod2.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw();

mod2.data %>% 
  filter(abs(.std.resid) > 3)




mod3<-glm(target~.-bk_transform-chas,  family=binomial, data=crime_training)
summary(mod3);

plot(mod3, which = 4, id.n = 3);

mod3.data <- augment(mod2) %>% 
  mutate(index = 1:n()) ;

car::vif(mod3)

mod3.data %>% top_n(3, .cooksd);

ggplot(mod3.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw();

mod3.data %>% 
  filter(abs(.std.resid) > 3)




  library(MASS)
step<-stepAIC(mod3, trace=FALSE)
step$anova



mod4<-glm(target ~ zn + nox + age + dis + rad + tax + ptratio + black + lstat + medv,  family=binomial, data=crime_training)
summary(mod4);

plot(mod4, which = 4, id.n = 3);

mod4.data <- augment(mod4) %>% 
  mutate(index = 1:n()) ;

car::vif(mod4)

mod4.data %>% top_n(3, .cooksd);

ggplot(mod4.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw();

mod4.data %>% 
  filter(abs(.std.resid) > 3)




  Xy<-crime_training_switch<-subset(crime_training, select=c(medv, lstat, black, ptratio, tax, rad, dis, age, rm, nox, chas, indus, zn , target))
library(leaps)
library(bestglm)
best_mod<-bestglm(Xy, IC="AIC", family=binomial)
best_mod





mod5<-glm(target~.-indus-chas-rm-bk_transform-black, family=binomial, data=crime_training)
summary(mod5);

plot(mod5, which = 4, id.n = 3);

mod5.data <- augment(mod5) %>% 
  mutate(index = 1:n()) ;

car::vif(mod5)

mod5.data %>% top_n(3, .cooksd);

ggplot(mod5.data, aes(index, .std.resid)) + 
  geom_point(aes(color = target), alpha = .5) +
  theme_bw();

mod5.data %>% 
  filter(abs(.std.resid) > 3)



  library(pscl)
pR2(mod5)



#The evaluation data does not contain the target response variable. To test predictive power, we need to partition our training data into a test and evaluation 
Train <- createDataPartition(crime_training$target, p=0.7, list=FALSE)
train <- crime_training[Train, ]
test <- crime_training[-Train, ]


#probabilities of prediction 
pred <- predict(mod5, newdata=test, type="response", na.action = na.pass)
scored.probability<-data.frame(as.double(pred))


#probability threshold 
pred_filter <- ifelse(pred > 0.5, 1, 0)
scored.class<-data.frame(as.integer(pred_filter))

#subset actuals
class<-data.frame(subset(test, select=c(target)))
class<-as.integer(class$target)

#library(plyr)
classes.df<-data.frame(class, scored.class, scored.probability)
str(classes.df)
names(classes.df)



library(pROC)
library(caret)

names(classes.df)<-c("class", "scored.class", "scored.probability")

matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2



#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix;

f1_reduced<- function(df)
  {
  TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1) 
  Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
  FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
  (2*TP)/(2*TP + Fn + FP)
}
f1_reduced(classes.df);

plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");

auc(roc(classes.df$class, classes.df$scored.probability))



exp(cbind("Odds ratio" = coef(mod5), confint.default(mod5, level = 0.95)))

DATA 621 Week 3

Vinicio Haro

June 26, 2018