Introduction)
We are given a set of data containing several predictor variables detailing city attributes such as lot sizes and property values. The data contains a response variable that detailsif a crime rate is greater than the median crime rate. Each row of the data represents a city/neighborhood.
Our goal is to build a model which predicts a cities crime propensity as a function of predictior variables.
The data can be pulled from the following github URL’s. Training: https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-training-data.csv Evaluation: https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-evaluation-data.csv
Read in the Data
## zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 369.30 3.70
## 2 0 19.58 1 0.871 5.403 100.0 1.3216 5 403 14.7 396.90 26.82
## 3 0 18.10 0 0.740 6.485 100.0 1.9784 24 666 20.2 386.73 18.85
## 4 30 4.93 0 0.428 6.393 7.8 7.0355 6 300 16.6 374.71 5.19
## 5 0 2.46 0 0.488 7.155 92.2 2.7006 3 193 17.8 394.12 4.82
## 6 0 8.56 0 0.520 6.781 71.3 2.8561 5 384 20.9 395.58 7.67
## 7 0 18.10 0 0.693 5.453 100.0 1.4896 24 666 20.2 396.90 30.59
## 8 0 18.10 0 0.693 4.519 100.0 1.6582 24 666 20.2 88.27 36.98
## 9 0 5.19 0 0.515 6.316 38.1 6.4584 5 224 20.2 389.71 5.68
## 10 80 3.64 0 0.392 5.876 19.1 9.2203 1 315 16.4 395.18 9.25
## medv target
## 1 50.0 1
## 2 13.4 1
## 3 15.4 1
## 4 23.7 0
## 5 37.9 0
## 6 26.5 0
## 7 5.0 1
## 8 7.0 1
## 9 22.2 0
## 10 20.9 0
Any transformations or alterations to any of the variables will be done to both the training and evaluation dataset.
## [1] 466
## [1] 14
## [1] "zn" "indus" "chas" "nox" "rm" "age" "dis"
## [8] "rad" "tax" "ptratio" "black" "lstat" "medv" "target"
The data contains 466 rows with each row belonging to a city/neighborhood. There are 14 variables,some continous, discrete, and coded categorical. The variable type plays a major role when it comes to EDA and modeling. our response variable is discrete containing a 1 or a 0. 1 means the crime rate is over the median and 0 means the crime rate is below the median.
The variables are defined as follows:
zn: proportion of residential land zoned for large lots (over 25000 square feet) (predictor variable) indus: proportion of non-retail business acres per suburb (predictor variable) chas: a dummy var. for whether the suburb borders the Charles River (1) or not (0) (predictor variable) nox: nitrogen oxides concentration (parts per 10 million) (predictor variable) rm: average number of rooms per dwelling (predictor variable) age: proportion of owner-occupied units built prior to 1940 (predictor variable) dis: weighted mean of distances to five Boston employment centers (predictor variable) rad: index of accessibility to radial highways (predictor variable) tax: full-value property-tax rate per $10,000 (predictor variable) ptratio: pupil-teacher ratio by town (predictor variable) black: 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town (predictor variable) lstat: lower status of the population (percent) (predictor variable) medv: median value of owner-occupied homes in $1000s (predictor variable) target: whether the crime rate is above the median crime rate (1) or not (0) (response variable)
Since we are dealing with coded categorical values in some instances, we need to think of EDA based off frequency rather than distributions on an x,yplot. What are the number of instances for each coded categorical or proportion variable?
Target(response)
##
## 0 1
## 237 229
There are slightly more instances where the crime rate is less than the median.
Chas
##
## 0 1
## 433 33
There are much fewer towns that border the Charles river.
Age
Zn
Indus
As for the other variables, we are able to plot their spread and descriptive statistics.
## nox rm dis rad
## Min. :0.3890 Min. :3.863 Min. : 1.130 Min. : 1.00
## 1st Qu.:0.4480 1st Qu.:5.887 1st Qu.: 2.101 1st Qu.: 4.00
## Median :0.5380 Median :6.210 Median : 3.191 Median : 5.00
## Mean :0.5543 Mean :6.291 Mean : 3.796 Mean : 9.53
## 3rd Qu.:0.6240 3rd Qu.:6.630 3rd Qu.: 5.215 3rd Qu.:24.00
## Max. :0.8710 Max. :8.780 Max. :12.127 Max. :24.00
## tax ptratio black lstat
## Min. :187.0 Min. :12.6 Min. : 0.32 Min. : 1.730
## 1st Qu.:281.0 1st Qu.:16.9 1st Qu.:375.61 1st Qu.: 7.043
## Median :334.5 Median :18.9 Median :391.34 Median :11.350
## Mean :409.5 Mean :18.4 Mean :357.12 Mean :12.631
## 3rd Qu.:666.0 3rd Qu.:20.2 3rd Qu.:396.24 3rd Qu.:16.930
## Max. :711.0 Max. :22.0 Max. :396.90 Max. :37.970
## medv
## Min. : 5.00
## 1st Qu.:17.02
## Median :21.20
## Mean :22.59
## 3rd Qu.:25.00
## Max. :50.00
The plot of the spreads give us information regarding what variables could contain extreme values. Of course we would need a deeper investigation to determine if the extreme values are cause of error or totally plausible. We know what to look out for furthur into EDA. Variables such as dis and ptratio have visible skews.
How does each variable correlate to the response variable?
## zn indus chas nox rm age
## -0.43168176 0.60485074 0.08004187 0.72610622 -0.15255334 0.63010625
## dis rad tax ptratio black lstat
## -0.61867312 0.62810492 0.61111331 0.25084892 -0.35295680 0.46912702
## medv target
## -0.27055071 1.00000000
There are some variables that are alarming. Indus, nox,age, rad, and tax have a strong positive correlation with the response variable while zn and dis have a strong negative correlation. How do the rest of variables correlate with each other?
##
## Attaching package: 'reshape2'
## The following object is masked from 'package:tidyr':
##
## smiths
From the correlation heatmap, chas stands out in the sense that it does not have much correlation with any of the other variables. It might be a good candidate for removal.
Nox has a high positive correlation with target. Nox represents nitrogen oxides concentration. This correlation is not so intuitive. On the outside what does nitrogen concentration have to do if a neighborhood/city has a crime rate above the median? Without any context such as pollution or air quality, it might be difficult to see the connection. Nox also has a very strong negative correlation with dis. Dis represents the weighted mean of distances to five Boston employment center. This correlation is more intuitive to infer. A employment center perhaps could mean the presense of industry including factories.
Rad and tax have a strong positive correlation 0.91. Rad measures the index of accessibility to radial highway while tax measures full-value property-tax rate per $10,00. This correlation is moe intuitive. Areas with high taxes usually have better infrastructure including number of roads.
The high correlation examples highlighted might imply that they could be removed from the model but of course this can only be verified through diagnostics and quality of output accuracy. Chas is the other variable that could potentially be removed due to the minimal correlation with any other variable.
## zn indus chas nox rm age dis rad tax
## 0 0 0 0 0 0 0 0 0
## ptratio black lstat medv target
## 0 0 0 0 0
## zn indus chas nox rm age dis rad tax
## 0 0 0 0 0 0 0 0 0
## ptratio black lstat medv
## 0 0 0 0
The data is complete in both the training and evaluation sets.
Unlike linear regression models, there are certain assumptions that do not to be satisfied. Such assumptions are the normality of residuals, independence of predictor variables, and minimum sample size. Since the response variable is binary, it does not make sense to apply a transform to the response variable. In the later sections, we will be implimenting a family of the GLM which provides more felixibility.
In our EDA, we noticed that some variables were proportions. These variables are good candidates for dummy coding if needed. The only one that stands out is the black variable. The black variable is defined as 1000(Bk - 0.63)2 where Bk is the proportion of blacks by town. We can reverse engineer this value to make the proportions look more like the other variables with proportions.
\[ BK=1000\times(bk-0.63)^2 \]
Divide both sides by 1000, take the square root and add 0.63 \[ \sqrt{\frac{1000}{BK}} + 0.63=bk \]
Lets transform this variable using the derived equation
## zn indus chas nox rm age dis rad tax ptratio black lstat medv
## 1 0 19.58 0 0.605 7.929 96.2 2.0459 5 403 14.7 369.3 3.7 50
## target bk_transform
## 1 1 2.275547
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.217 2.219 2.229 2.749 2.262 56.532
I do not believe that we need to transform any other variable mainly because of the type of modeling we aim to perform. During the model building phase, we will have a better idea if we should use interactive terms or drop terms all together. Since the response variable is a binary operator, it will be left unchanged.
Our first model takes all the variables except the transformed black variable
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble 1.4.2 ✔ dplyr 0.7.4
## ✔ readr 1.1.1 ✔ stringr 1.3.0
## ✔ purrr 0.2.4 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Warning: package 'broom' was built under R version 3.4.4
## Warning: package 'car' was built under R version 3.4.4
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.4.4
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
##
## Call:
## glm(formula = target ~ . - bk_transform, family = binomial, data = crime_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2854 -0.1372 -0.0017 0.0020 3.4721
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -36.839521 7.028726 -5.241 1.59e-07 ***
## zn -0.061720 0.034410 -1.794 0.072868 .
## indus -0.072580 0.048546 -1.495 0.134894
## chas 1.032352 0.759627 1.359 0.174139
## nox 50.159513 8.049503 6.231 4.62e-10 ***
## rm -0.692145 0.741431 -0.934 0.350548
## age 0.034522 0.013883 2.487 0.012895 *
## dis 0.765795 0.234407 3.267 0.001087 **
## rad 0.663015 0.165135 4.015 5.94e-05 ***
## tax -0.006593 0.003064 -2.152 0.031422 *
## ptratio 0.442217 0.132234 3.344 0.000825 ***
## black -0.013094 0.006680 -1.960 0.049974 *
## lstat 0.047571 0.054508 0.873 0.382802
## medv 0.199734 0.071022 2.812 0.004919 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 186.15 on 452 degrees of freedom
## AIC: 214.15
##
## Number of Fisher Scoring iterations: 9
## zn indus chas nox rm age dis rad
## 1.846737 2.699929 1.262505 4.296876 6.070541 2.556302 4.025192 1.955112
## tax ptratio black lstat medv
## 2.174569 2.340600 1.091380 2.681384 8.567935
## target zn indus chas nox rm age dis rad tax ptratio black
## 1 1 22 5.86 0 0.431 8.259 8.4 8.9067 7 330 19.1 396.90
## 2 0 0 8.56 0 0.520 6.405 85.4 2.7147 5 384 20.9 70.80
## 3 1 0 10.59 0 0.489 5.412 9.8 3.5875 4 277 18.6 348.93
## lstat medv bk_transform .fitted .se.fit .resid .hat .sigma
## 1 3.54 42.8 2.217302 -1.177843 1.119924 1.700763 0.2257719 0.6359713
## 2 10.63 18.6 4.388230 2.535283 2.142840 -2.285406 0.3123801 0.6292069
## 3 29.55 23.7 2.322898 -3.118805 1.446039 2.514782 0.0847824 0.6304143
## .cooksd .std.resid index
## 1 0.08736447 1.932900 14
## 2 0.59554988 -2.756063 159
## 3 0.16353423 2.628683 457
## target zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 1 20 6.96 0 0.464 5.856 42.1 4.429 3 223 18.6 388.65 13
## medv bk_transform .fitted .se.fit .resid .hat .sigma
## 1 21.1 2.23406 -6.025248 0.9877586 3.472078 0.002346817 0.6212491
## .cooksd .std.resid index
## 1 0.06968255 3.476159 338
Before we make an assesment on the significance of predictors, lets build a model with the transformed black variable instead of the original variable.
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
##
## Call:
## glm(formula = target ~ . - black, family = binomial, data = crime_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8263 -0.1428 -0.0016 0.0021 3.4773
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -43.094274 7.092912 -6.076 1.23e-09 ***
## zn -0.065972 0.034845 -1.893 0.05832 .
## indus -0.065205 0.047752 -1.365 0.17210
## chas 0.933563 0.755711 1.235 0.21670
## nox 49.674892 8.001002 6.209 5.35e-10 ***
## rm -0.642784 0.728322 -0.883 0.37748
## age 0.034300 0.013870 2.473 0.01340 *
## dis 0.757860 0.232801 3.255 0.00113 **
## rad 0.668084 0.164127 4.071 4.69e-05 ***
## tax -0.006311 0.002988 -2.112 0.03467 *
## ptratio 0.408886 0.127689 3.202 0.00136 **
## lstat 0.047660 0.054204 0.879 0.37926
## medv 0.188734 0.069436 2.718 0.00657 **
## bk_transform 0.865903 0.903271 0.959 0.33775
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 191.15 on 452 degrees of freedom
## AIC: 219.15
##
## Number of Fisher Scoring iterations: 9
## zn indus chas nox rm
## 1.833812 2.674167 1.247015 4.225539 5.890692
## age dis rad tax ptratio
## 2.583552 3.969640 1.955796 2.150534 2.299080
## lstat medv bk_transform
## 2.650716 8.347214 1.046181
## target zn indus chas nox rm age dis rad tax ptratio black
## 1 1 22 5.86 0 0.431 8.259 8.4 8.9067 7 330 19.1 396.90
## 2 0 0 8.56 0 0.520 6.405 85.4 2.7147 5 384 20.9 70.80
## 3 1 0 10.59 0 0.489 5.412 9.8 3.5875 4 277 18.6 348.93
## lstat medv bk_transform .fitted .se.fit .resid .hat
## 1 3.54 42.8 2.217302 -1.2184250 1.119548 1.718995 0.22076293
## 2 10.63 18.6 4.388230 0.3274957 1.898050 -1.319274 0.87694457
## 3 29.55 23.7 2.322898 -3.4954511 1.442867 2.655310 0.05948963
## .sigma .cooksd .std.resid index
## 1 0.6445302 0.08782419 1.947332 14
## 2 0.6264712 5.73947492 -3.760839 159
## 3 0.6381263 0.15835842 2.738000 457
## target zn indus chas nox rm age dis rad tax ptratio black
## 1 0 0 8.56 0 0.520 6.405 85.4 2.7147 5 384 20.9 70.80
## 2 1 20 6.96 0 0.464 5.856 42.1 4.4290 3 223 18.6 388.65
## lstat medv bk_transform .fitted .se.fit .resid .hat
## 1 10.63 18.6 4.38823 0.3274957 1.8980499 -1.319274 0.876944574
## 2 13.00 21.1 2.23406 -6.0434942 0.9907719 3.477316 0.002318712
## .sigma .cooksd .std.resid index
## 1 0.6264712 5.73947492 -3.760839 159
## 2 0.6300430 0.07011183 3.481355 338
A shocking development arose when looking at the model using the transformed black variable. It seems that the transformed black variable causes the model to contain a variable that perfectly differentiates between 0’1 and 1s. This is caused to to a bias within the predictors. We will retain the unchanged black transform variable and use the AIC and deviance from mod1 as our benchmark .
We should consider removing the chas variable. During the EDA step, we found evidence that chas has a low correlation with any of the other variables. It also has a high p-value
##
## Call:
## glm(formula = target ~ . - bk_transform - chas, family = binomial,
## data = crime_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.2099 -0.1356 -0.0013 0.0018 3.4691
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -35.393661 6.866186 -5.155 2.54e-07 ***
## zn -0.069129 0.034604 -1.998 0.04575 *
## indus -0.057736 0.046446 -1.243 0.21384
## nox 48.003143 7.781466 6.169 6.88e-10 ***
## rm -0.703120 0.738498 -0.952 0.34105
## age 0.035141 0.013766 2.553 0.01069 *
## dis 0.738029 0.231466 3.189 0.00143 **
## rad 0.720422 0.163465 4.407 1.05e-05 ***
## tax -0.007419 0.003002 -2.472 0.01345 *
## ptratio 0.409562 0.128375 3.190 0.00142 **
## black -0.012478 0.006486 -1.924 0.05438 .
## lstat 0.057059 0.053145 1.074 0.28298
## medv 0.201062 0.070966 2.833 0.00461 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 188.02 on 453 degrees of freedom
## AIC: 214.02
##
## Number of Fisher Scoring iterations: 9
## zn indus nox rm age dis rad tax
## 1.829739 2.522200 4.044732 5.792163 2.521459 3.974603 1.932627 2.073288
## ptratio black lstat medv
## 2.240897 1.078848 2.512986 8.305521
## target zn indus chas nox rm age dis rad tax ptratio black
## 1 1 22 5.86 0 0.431 8.259 8.4 8.9067 7 330 19.1 396.90
## 2 0 0 8.56 0 0.520 6.405 85.4 2.7147 5 384 20.9 70.80
## 3 1 0 10.59 0 0.489 5.412 9.8 3.5875 4 277 18.6 348.93
## lstat medv bk_transform .fitted .se.fit .resid .hat
## 1 3.54 42.8 2.217302 -1.2184250 1.119548 1.718995 0.22076293
## 2 10.63 18.6 4.388230 0.3274957 1.898050 -1.319274 0.87694457
## 3 29.55 23.7 2.322898 -3.4954511 1.442867 2.655310 0.05948963
## .sigma .cooksd .std.resid index
## 1 0.6445302 0.08782419 1.947332 14
## 2 0.6264712 5.73947492 -3.760839 159
## 3 0.6381263 0.15835842 2.738000 457
## target zn indus chas nox rm age dis rad tax ptratio black
## 1 0 0 8.56 0 0.520 6.405 85.4 2.7147 5 384 20.9 70.80
## 2 1 20 6.96 0 0.464 5.856 42.1 4.4290 3 223 18.6 388.65
## lstat medv bk_transform .fitted .se.fit .resid .hat
## 1 10.63 18.6 4.38823 0.3274957 1.8980499 -1.319274 0.876944574
## 2 13.00 21.1 2.23406 -6.0434942 0.9907719 3.477316 0.002318712
## .sigma .cooksd .std.resid index
## 1 0.6264712 5.73947492 -3.760839 159
## 2 0.6300430 0.07011183 3.481355 338
We can build additional models through the use of backstepping methods and use anova to differentiate the significance betwee models
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
## Stepwise Model Path
## Analysis of Deviance Table
##
## Initial Model:
## target ~ (zn + indus + chas + nox + rm + age + dis + rad + tax +
## ptratio + black + lstat + medv + bk_transform) - bk_transform -
## chas
##
## Final Model:
## target ~ zn + nox + age + dis + rad + tax + ptratio + black +
## lstat + medv
##
##
## Step Df Deviance Resid. Df Resid. Dev AIC
## 1 453 188.0190 214.0190
## 2 - rm 1 0.9131359 454 188.9321 212.9321
## 3 - indus 1 1.5819995 455 190.5141 212.5141
AIC backstepping removed the rm predictor and the indus predictor. Both of these predictors had high p values in the full model.
Lets formulate the model from AIC backstep.
##
## Call:
## glm(formula = target ~ zn + nox + age + dis + rad + tax + ptratio +
## black + lstat + medv, family = binomial, data = crime_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1610 -0.1556 -0.0017 0.0019 3.3921
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -34.831121 6.585833 -5.289 1.23e-07 ***
## zn -0.070361 0.032981 -2.133 0.032891 *
## nox 43.191700 6.752210 6.397 1.59e-10 ***
## age 0.027461 0.011238 2.444 0.014544 *
## dis 0.671699 0.216605 3.101 0.001929 **
## rad 0.743199 0.154146 4.821 1.43e-06 ***
## tax -0.008639 0.002758 -3.133 0.001732 **
## ptratio 0.353646 0.115275 3.068 0.002156 **
## black -0.011777 0.006317 -1.864 0.062280 .
## lstat 0.068969 0.048236 1.430 0.152769
## medv 0.147677 0.042484 3.476 0.000509 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 190.51 on 455 degrees of freedom
## AIC: 212.51
##
## Number of Fisher Scoring iterations: 9
## zn nox age dis rad tax ptratio black
## 1.848232 3.198787 1.784884 3.624938 1.690236 1.822941 1.889717 1.064965
## lstat medv
## 2.121835 3.021578
## target zn nox age dis rad tax ptratio black lstat medv .fitted
## 1 0 0 0.520 85.4 2.7147 5 384 20.9 70.80 10.63 18.6 2.233041
## 2 1 20 0.464 42.1 4.4290 3 223 18.6 388.65 13.00 21.1 -5.750074
## 3 1 0 0.489 9.8 3.5875 4 277 18.6 348.93 29.55 23.7 -2.445420
## .se.fit .resid .hat .sigma .cooksd .std.resid index
## 1 2.0078998 -2.160961 0.352559871 0.6354115 0.71324454 -2.685636 159
## 2 0.9463457 3.392124 0.002832146 0.6278678 0.08136015 3.396937 338
## 3 1.3230931 2.248802 0.128510226 0.6378504 0.17744046 2.408907 457
## target zn nox age dis rad tax ptratio black lstat medv .fitted
## 1 1 20 0.464 42.1 4.429 3 223 18.6 388.65 13 21.1 -5.750074
## .se.fit .resid .hat .sigma .cooksd .std.resid index
## 1 0.9463457 3.392124 0.002832146 0.6278678 0.08136015 3.396937 338
We can also use the bestglm package to narrow down viable models using several goodness of fit criteria. We would need to change the data frame such that the last column contains the response variable.
## Warning: package 'bestglm' was built under R version 3.4.4
## Morgan-Tatar search since family is non-gaussian.
## AIC
## BICq equivalent for q in (0.88551381531015, 0.897513637623221)
## Best Model:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -34.83112133 6.585833395 -5.288795 1.231249e-07
## medv 0.14767672 0.042483810 3.476070 5.088187e-04
## lstat 0.06896872 0.048235949 1.429820 1.527687e-01
## black -0.01177713 0.006317191 -1.864299 6.227974e-02
## ptratio 0.35364572 0.115275321 3.067835 2.156153e-03
## tax -0.00863917 0.002757684 -3.132763 1.731693e-03
## rad 0.74319887 0.154146474 4.821381 1.425678e-06
## dis 0.67169851 0.216605206 3.101027 1.928510e-03
## age 0.02746097 0.011238231 2.443531 1.454431e-02
## nox 43.19170029 6.752210255 6.396676 1.587959e-10
## zn -0.07036110 0.032980585 -2.133410 3.289113e-02
Formulate the model built from the best GLM package in addition to removing non significant predictors
##
## Call:
## glm(formula = target ~ . - indus - chas - rm - bk_transform -
## black, family = binomial, data = crime_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9279 -0.1640 -0.0016 0.0027 3.3989
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -38.786168 6.161170 -6.295 3.07e-10 ***
## zn -0.072933 0.033022 -2.209 0.027199 *
## nox 43.014846 6.689945 6.430 1.28e-10 ***
## age 0.028379 0.011401 2.489 0.012809 *
## dis 0.658786 0.214787 3.067 0.002161 **
## rad 0.739415 0.152375 4.853 1.22e-06 ***
## tax -0.008045 0.002651 -3.035 0.002408 **
## ptratio 0.334196 0.111780 2.990 0.002792 **
## lstat 0.064814 0.048427 1.338 0.180769
## medv 0.138257 0.041630 3.321 0.000897 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 195.51 on 456 degrees of freedom
## AIC: 215.51
##
## Number of Fisher Scoring iterations: 9
## zn nox age dis rad tax ptratio lstat
## 1.817650 3.146723 1.840891 3.564748 1.681321 1.775775 1.871613 2.133702
## medv
## 2.975434
## target zn indus chas nox rm age dis rad tax ptratio black
## 1 1 22 5.86 0 0.431 8.259 8.4 8.9067 7 330 19.1 396.90
## 2 1 20 6.96 0 0.464 5.856 42.1 4.4290 3 223 18.6 388.65
## 3 1 0 10.59 0 0.489 5.412 9.8 3.5875 4 277 18.6 348.93
## lstat medv bk_transform .fitted .se.fit .resid .hat
## 1 3.54 42.8 2.217302 -0.6941428 0.9919713 1.482752 0.218595632
## 2 13.00 21.1 2.234060 -5.7732971 0.9456040 3.398942 0.002763196
## 3 29.55 23.7 2.322898 -2.9731063 1.3334338 2.458856 0.082302853
## .sigma .cooksd .std.resid index
## 1 0.6507833 0.07167244 1.677376 14
## 2 0.6357999 0.08935649 3.403647 338
## 3 0.6443780 0.19108200 2.566749 457
## target zn indus chas nox rm age dis rad tax ptratio black lstat
## 1 1 20 6.96 0 0.464 5.856 42.1 4.429 3 223 18.6 388.65 13
## medv bk_transform .fitted .se.fit .resid .hat .sigma
## 1 21.1 2.23406 -5.773297 0.945604 3.398942 0.002763196 0.6357999
## .cooksd .std.resid index
## 1 0.08935649 3.403647 338
IV Model Selection) We built a total of 5 models with different combinations of predictors. This section, we define the methodology used to select the best model to use on the evaluation data based on certain key metrics that we will define later on in this study.
For each model we produced an examination of cook’s distance, standard residuals, variance inflation numbers, and model summary. Based on a combination of these diagnostics, I decided to go with model 5. Model 5 as an AIC of 212.51, which is marginally different than the AIC produced by the other models. Model 5 is less complex and does not have any predictors with high p values.
Psuedo R square value for logisitc regression
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
## llh llhNull G2 McFadden r2ML
## -97.7573458 -322.9379132 450.3611349 0.6972875 0.6195651
## r2CU
## 0.8261680
McFadden’s R square value is used as the linear regression r square but for logistic regression instead. The McFadden’s statistic of 0.697 indicates that model 5 has decent predictive power on a 0 to 1 scale.
We will take this model and apply it to our testing data
Generate a confusion matrix
library(pROC)
## Warning: package 'pROC' was built under R version 3.4.4
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
library(caret)
names(classes.df)<-c("class", "scored.class", "scored.probability")
matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2
## class
## scored.class 1 0
## 1 62 7
## 0 6 64
Use the functionality of the caret package to compute relevant statistics from the confusion matrix
## Confusion Matrix and Statistics
##
## class
## scored.class 1 0
## 1 62 7
## 0 6 64
##
## Accuracy : 0.9065
## 95% CI : (0.8454, 0.9493)
## No Information Rate : 0.5108
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.8129
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9118
## Specificity : 0.9014
## Pos Pred Value : 0.8986
## Neg Pred Value : 0.9143
## Prevalence : 0.4892
## Detection Rate : 0.4460
## Detection Prevalence : 0.4964
## Balanced Accuracy : 0.9066
##
## 'Positive' Class : 1
##
## [1] 0.9051095
## Area under the curve: 0.966
Model 5 has a high accuracy and high F1 score. The AUC is 0.9585. We have more than enough evidence to conclude that model 5 is the best model to go with.
Interpretation) Now that the best model has been picked, what does the model actually mean in the sense of crime propensity?
##
## Call:
## glm(formula = target ~ . - indus - chas - rm - bk_transform -
## black, family = binomial, data = crime_training)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9279 -0.1640 -0.0016 0.0027 3.3989
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -38.786168 6.161170 -6.295 3.07e-10 ***
## zn -0.072933 0.033022 -2.209 0.027199 *
## nox 43.014846 6.689945 6.430 1.28e-10 ***
## age 0.028379 0.011401 2.489 0.012809 *
## dis 0.658786 0.214787 3.067 0.002161 **
## rad 0.739415 0.152375 4.853 1.22e-06 ***
## tax -0.008045 0.002651 -3.035 0.002408 **
## ptratio 0.334196 0.111780 2.990 0.002792 **
## lstat 0.064814 0.048427 1.338 0.180769
## medv 0.138257 0.041630 3.321 0.000897 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 645.88 on 465 degrees of freedom
## Residual deviance: 195.51 on 456 degrees of freedom
## AIC: 215.51
##
## Number of Fisher Scoring iterations: 9
We can also convert coefficients to odds ratios
## Odds ratio 2.5 % 97.5 %
## (Intercept) 1.430149e-17 8.146750e-23 2.510606e-12
## zn 9.296628e-01 8.713995e-01 9.918218e-01
## nox 4.798551e+18 9.696589e+12 2.374659e+24
## age 1.028785e+00 1.006050e+00 1.052033e+00
## dis 1.932445e+00 1.268469e+00 2.943979e+00
## rad 2.094711e+00 1.553896e+00 2.823750e+00
## tax 9.919877e-01 9.868469e-01 9.971552e-01
## ptratio 1.396817e+00 1.121999e+00 1.738947e+00
## lstat 1.066961e+00 9.703478e-01 1.173193e+00
## medv 1.148270e+00 1.058299e+00 1.245891e+00
The increase in logit score per unit increase is high relative to the other coefficients for medv, ptratio, and nox. We see that zn and tax have a negative effect on the crime rate being greater than the median.
Appendix:
crime_training <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-training-data.csv"), header =TRUE)
crime_evaluation <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-evaluation-data.csv"), header =TRUE)
head(crime_training, 10)
nrow(crime_training);ncol(crime_training);names(crime_training)
table(crime_training$target);
barplot(table(crime_training$target), ylim=c(0, 500), xlab="Result", ylab="N", col="black",
main="Distribution of Target(Response")
table(crime_training$chas);
barplot(table(crime_training$chas), ylim=c(0, 500), xlab="Result", ylab="N", col="black",
main="Distribution of Chas")
#table(crime_training$age);
barplot(table(crime_training$age), ylim=c(0, 50), xlab="Result", ylab="N", col="black",
main="Distribution of Age")
#table(crime_training$age);
barplot(table(crime_training$zn), ylim=c(0, 150), xlab="Result", ylab="N", col="black",
main="Distribution of Zn")
#table(crime_training$age);
barplot(table(crime_training$indus), ylim=c(0, 150), xlab="Result", ylab="N", col="black",
main="Distribution of Indus")
library(ggplot2)
library(tidyr)
crime_training_plot_subset <- subset(crime_training, select = -c(chas, target, age, zn,indus))
ggplot(gather(crime_training_plot_subset), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x');summary(crime_training_plot_subset)
apply(crime_training,2, function(col)cor(col, crime_training$target))
#correlation matrix and visualization
correlation_matrix <- round(cor(crime_training),2)
# Get lower triangle of the correlation matrix
get_lower_tri<-function(correlation_matrix){
correlation_matrix[upper.tri(correlation_matrix)] <- NA
return(correlation_matrix)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(correlation_matrix){
correlation_matrix[lower.tri(correlation_matrix)]<- NA
return(correlation_matrix)
}
upper_tri <- get_upper_tri(correlation_matrix)
library(reshape2)
# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 15, hjust = 1))+
coord_fixed()
#add nice labels
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x=element_text(size=rel(0.8), angle=90),
axis.text.y=element_text(size=rel(0.8)),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwimoneyball_training2h = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
#missing data
colSums(is.na(crime_training));colSums(is.na(crime_evaluation))
crime_training$bk_transform=sqrt((1000/crime_training$black))+0.63
crime_evaluation$bk_transform=sqrt((1000/crime_evaluation$black))+0.63
head(crime_training, 1);summary(crime_training$bk_transform)
library(tidyverse)
library(dplyr)
library(broom)
library(car)
mod1<-glm(target~.-bk_transform, family=binomial, data=crime_training)
summary(mod1);
plot(mod1, which = 4, id.n = 3);
mod1.data <- augment(mod1) %>%
mutate(index = 1:n()) ;
car::vif(mod1)
mod1.data %>% top_n(3, .cooksd);
ggplot(mod1.data, aes(index, .std.resid)) +
geom_point(aes(color = target), alpha = .5) +
theme_bw();
mod1.data %>%
filter(abs(.std.resid) > 3)
mod2<-glm(target~.-black, family=binomial, data=crime_training)
summary(mod2);
plot(mod2, which = 4, id.n = 3);
mod2.data <- augment(mod2) %>%
mutate(index = 1:n()) ;
car::vif(mod2)
mod2.data %>% top_n(3, .cooksd);
ggplot(mod2.data, aes(index, .std.resid)) +
geom_point(aes(color = target), alpha = .5) +
theme_bw();
mod2.data %>%
filter(abs(.std.resid) > 3)
mod3<-glm(target~.-bk_transform-chas, family=binomial, data=crime_training)
summary(mod3);
plot(mod3, which = 4, id.n = 3);
mod3.data <- augment(mod2) %>%
mutate(index = 1:n()) ;
car::vif(mod3)
mod3.data %>% top_n(3, .cooksd);
ggplot(mod3.data, aes(index, .std.resid)) +
geom_point(aes(color = target), alpha = .5) +
theme_bw();
mod3.data %>%
filter(abs(.std.resid) > 3)
library(MASS)
step<-stepAIC(mod3, trace=FALSE)
step$anova
mod4<-glm(target ~ zn + nox + age + dis + rad + tax + ptratio + black + lstat + medv, family=binomial, data=crime_training)
summary(mod4);
plot(mod4, which = 4, id.n = 3);
mod4.data <- augment(mod4) %>%
mutate(index = 1:n()) ;
car::vif(mod4)
mod4.data %>% top_n(3, .cooksd);
ggplot(mod4.data, aes(index, .std.resid)) +
geom_point(aes(color = target), alpha = .5) +
theme_bw();
mod4.data %>%
filter(abs(.std.resid) > 3)
Xy<-crime_training_switch<-subset(crime_training, select=c(medv, lstat, black, ptratio, tax, rad, dis, age, rm, nox, chas, indus, zn , target))
library(leaps)
library(bestglm)
best_mod<-bestglm(Xy, IC="AIC", family=binomial)
best_mod
mod5<-glm(target~.-indus-chas-rm-bk_transform-black, family=binomial, data=crime_training)
summary(mod5);
plot(mod5, which = 4, id.n = 3);
mod5.data <- augment(mod5) %>%
mutate(index = 1:n()) ;
car::vif(mod5)
mod5.data %>% top_n(3, .cooksd);
ggplot(mod5.data, aes(index, .std.resid)) +
geom_point(aes(color = target), alpha = .5) +
theme_bw();
mod5.data %>%
filter(abs(.std.resid) > 3)
library(pscl)
pR2(mod5)
#The evaluation data does not contain the target response variable. To test predictive power, we need to partition our training data into a test and evaluation
library(caret)
Train <- createDataPartition(crime_training$target, p=0.7, list=FALSE)
train <- crime_training[Train, ]
test <- crime_training[-Train, ]
#probabilities of prediction
pred <- predict(mod5, newdata=test, type="response", na.action = na.pass)
scored.probability<-data.frame(as.double(pred))
#probability threshold
pred_filter <- ifelse(pred > 0.5, 1, 0)
scored.class<-data.frame(as.integer(pred_filter))
#subset actuals
class<-data.frame(subset(test, select=c(target)))
class<-as.integer(class$target)
#library(plyr)
classes.df<-data.frame(class, scored.class, scored.probability)
str(classes.df)
names(classes.df)
library(pROC)
library(caret)
names(classes.df)<-c("class", "scored.class", "scored.probability")
matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2
#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix;
f1_reduced<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(2*TP)/(2*TP + Fn + FP)
}
f1_reduced(classes.df);
plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");
auc(roc(classes.df$class, classes.df$scored.probability))
summary(mod5)
#exp(coefficients(mod5))
exp(cbind("Odds ratio" = coef(mod5), confint.default(mod5, level = 0.95)))
crime_training <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-training-data.csv"), header =TRUE)
crime_evaluation <- read.csv(url("https://raw.githubusercontent.com/vindication09/DATA-621-Week-3/master/crime-evaluation-data.csv"), header =TRUE)
head(crime_training, 10)
nrow(crime_training);ncol(crime_training);names(crime_training)
table(crime_training$target);
barplot(table(crime_training$target), ylim=c(0, 500), xlab="Result", ylab="N", col="black",
main="Distribution of Target(Response")
table(crime_training$chas);
barplot(table(crime_training$chas), ylim=c(0, 500), xlab="Result", ylab="N", col="black",
main="Distribution of Chas")
#table(crime_training$age);
barplot(table(crime_training$age), ylim=c(0, 50), xlab="Result", ylab="N", col="black",
main="Distribution of Age")
#table(crime_training$age);
barplot(table(crime_training$zn), ylim=c(0, 150), xlab="Result", ylab="N", col="black",
main="Distribution of Zn")
#table(crime_training$age);
barplot(table(crime_training$indus), ylim=c(0, 150), xlab="Result", ylab="N", col="black",
main="Distribution of Indus")
library(ggplot2)
library(tidyr)
crime_training_plot_subset <- subset(crime_training, select = -c(chas, target, age, zn,indus))
ggplot(gather(crime_training_plot_subset), aes(value)) +
geom_histogram(bins = 10) +
facet_wrap(~key, scales = 'free_x');summary(crime_training_plot_subset)
apply(crime_training,2, function(col)cor(col, crime_training$target))
#correlation matrix and visualization
correlation_matrix <- round(cor(crime_training),2)
# Get lower triangle of the correlation matrix
get_lower_tri<-function(correlation_matrix){
correlation_matrix[upper.tri(correlation_matrix)] <- NA
return(correlation_matrix)
}
# Get upper triangle of the correlation matrix
get_upper_tri <- function(correlation_matrix){
correlation_matrix[lower.tri(correlation_matrix)]<- NA
return(correlation_matrix)
}
upper_tri <- get_upper_tri(correlation_matrix)
library(reshape2)
# Melt the correlation matrix
melted_correlation_matrix <- melt(upper_tri, na.rm = TRUE)
# Heatmap
library(ggplot2)
ggheatmap <- ggplot(data = melted_correlation_matrix, aes(Var2, Var1, fill = value))+
geom_tile(color = "white")+
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0, limit = c(-1,1), space = "Lab",
name="Pearson\nCorrelation") +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, vjust = 1,
size = 15, hjust = 1))+
coord_fixed()
#add nice labels
ggheatmap +
geom_text(aes(Var2, Var1, label = value), color = "black", size = 3) +
theme(
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.x=element_text(size=rel(0.8), angle=90),
axis.text.y=element_text(size=rel(0.8)),
panel.grid.major = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
axis.ticks = element_blank(),
legend.justification = c(1, 0),
legend.position = c(0.6, 0.7),
legend.direction = "horizontal")+
guides(fill = guide_colorbar(barwimoneyball_training2h = 7, barheight = 1,
title.position = "top", title.hjust = 0.5))
colSums(is.na(crime_training));colSums(is.na(crime_evaluation))
crime_training$bk_transform=sqrt((1000/crime_training$black))+0.63
crime_evaluation$bk_transform=sqrt((1000/crime_evaluation$black))+0.63
head(crime_training, 1);summary(crime_training$bk_transform)
library(tidyverse)
library(dplyr)
library(broom)
library(car)
mod1<-glm(target~.-bk_transform, family=binomial, data=crime_training)
summary(mod1);
plot(mod1, which = 4, id.n = 3);
mod1.data <- augment(mod1) %>%
mutate(index = 1:n()) ;
car::vif(mod1)
mod1.data %>% top_n(3, .cooksd);
ggplot(mod1.data, aes(index, .std.resid)) +
geom_point(aes(color = target), alpha = .5) +
theme_bw();
mod1.data %>%
filter(abs(.std.resid) > 3)
mod2<-glm(target~.-black, family=binomial, data=crime_training)
summary(mod2);
plot(mod2, which = 4, id.n = 3);
mod2.data <- augment(mod2) %>%
mutate(index = 1:n()) ;
car::vif(mod2)
mod2.data %>% top_n(3, .cooksd);
ggplot(mod2.data, aes(index, .std.resid)) +
geom_point(aes(color = target), alpha = .5) +
theme_bw();
mod2.data %>%
filter(abs(.std.resid) > 3)
mod3<-glm(target~.-bk_transform-chas, family=binomial, data=crime_training)
summary(mod3);
plot(mod3, which = 4, id.n = 3);
mod3.data <- augment(mod2) %>%
mutate(index = 1:n()) ;
car::vif(mod3)
mod3.data %>% top_n(3, .cooksd);
ggplot(mod3.data, aes(index, .std.resid)) +
geom_point(aes(color = target), alpha = .5) +
theme_bw();
mod3.data %>%
filter(abs(.std.resid) > 3)
library(MASS)
step<-stepAIC(mod3, trace=FALSE)
step$anova
mod4<-glm(target ~ zn + nox + age + dis + rad + tax + ptratio + black + lstat + medv, family=binomial, data=crime_training)
summary(mod4);
plot(mod4, which = 4, id.n = 3);
mod4.data <- augment(mod4) %>%
mutate(index = 1:n()) ;
car::vif(mod4)
mod4.data %>% top_n(3, .cooksd);
ggplot(mod4.data, aes(index, .std.resid)) +
geom_point(aes(color = target), alpha = .5) +
theme_bw();
mod4.data %>%
filter(abs(.std.resid) > 3)
Xy<-crime_training_switch<-subset(crime_training, select=c(medv, lstat, black, ptratio, tax, rad, dis, age, rm, nox, chas, indus, zn , target))
library(leaps)
library(bestglm)
best_mod<-bestglm(Xy, IC="AIC", family=binomial)
best_mod
mod5<-glm(target~.-indus-chas-rm-bk_transform-black, family=binomial, data=crime_training)
summary(mod5);
plot(mod5, which = 4, id.n = 3);
mod5.data <- augment(mod5) %>%
mutate(index = 1:n()) ;
car::vif(mod5)
mod5.data %>% top_n(3, .cooksd);
ggplot(mod5.data, aes(index, .std.resid)) +
geom_point(aes(color = target), alpha = .5) +
theme_bw();
mod5.data %>%
filter(abs(.std.resid) > 3)
library(pscl)
pR2(mod5)
#The evaluation data does not contain the target response variable. To test predictive power, we need to partition our training data into a test and evaluation
Train <- createDataPartition(crime_training$target, p=0.7, list=FALSE)
train <- crime_training[Train, ]
test <- crime_training[-Train, ]
#probabilities of prediction
pred <- predict(mod5, newdata=test, type="response", na.action = na.pass)
scored.probability<-data.frame(as.double(pred))
#probability threshold
pred_filter <- ifelse(pred > 0.5, 1, 0)
scored.class<-data.frame(as.integer(pred_filter))
#subset actuals
class<-data.frame(subset(test, select=c(target)))
class<-as.integer(class$target)
#library(plyr)
classes.df<-data.frame(class, scored.class, scored.probability)
str(classes.df)
names(classes.df)
library(pROC)
library(caret)
names(classes.df)<-c("class", "scored.class", "scored.probability")
matrix.df2<-with(classes.df, table(scored.class, class)[2:1, 2:1])
matrix.df2
#confusion matrix
caret_matrix <- confusionMatrix(matrix.df2)
#Information from Confusion matrix
caret_matrix;
f1_reduced<- function(df)
{
TP <- sum(classes.df$class == 1 & classes.df$scored.class == 1)
Fn <- sum(classes.df$class == 1 & classes.df$scored.class == 0)
FP <- sum(classes.df$class == 0 & classes.df$scored.class == 1)
(2*TP)/(2*TP + Fn + FP)
}
f1_reduced(classes.df);
plot(roc(classes.df$class, classes.df$scored.probability), main="ROC Curve");
auc(roc(classes.df$class, classes.df$scored.probability))
exp(cbind("Odds ratio" = coef(mod5), confint.default(mod5, level = 0.95)))