WOE (Weight of Evidence) and IV (Information Value)

WOE and IV are used for select categorical variables (and sometimes non-categorical too) in logistical regression. It is not the most common method, nevertheless the method could be useful, if we have many categorical variables and we want to decide which ones are the better fit for our logistical model.

How WOE and IV are calculated. Theory.

My source is http://support.sas.com/documentation/cdl/en/prochp/66704/HTML/default/viewer.htm#prochp_hpbin_details02.htm.

If \(X\) is a categorical independent variable, \(Y\) is a binary dependent variable, where \(Y=1\) is an event and \(Y=0\) is a non-event, then:

\(WOE_x= ln(\frac {p^{non-vent}_{attribute}} {p^{event}_{attribute}})= ln \left (\frac { \frac { {N^{non-event}_{attribute}}} {N^{total}_{attribute} }} {\frac{ {N^{event}_{attribute}}} {N^{total}_{attribute}} } \right)\),

where \(p\) - probability of \(Y\). And \(N\) number of \(Y=1\), \(Y=0\), or total.

Sometimes \(p\) will be 0, and we do not want it, as \(ln\) of 0 is undefined. To address it, we can add \(x\) (commonly \(x=0.5\)) to our calulations:

\(WOE_x= ln \left (\frac { \frac { {N^{non-event}_{attribute}+x}} {N^{total}_{attribute} }} {\frac{ {N^{event}_{attribute}+x}} {N^{total}_{attribute}} } \right)\)

IV is our final results, that should help us to decide if we should use \(X\) in our logistical model or not.

\(IV= \sum_{i=1}^m \left (\frac {N^{non-event}_{attribute}+x} {N^{total}_{attribute}} - \frac {N^{event}_{attribute}+x} {N^{total}_{attribute}} \right)*WOE_i\)

Generally speaking the higher IV indicates a better predictive power of \(X\).

Practical Example. HW4.

I will use HW4 data to demonstrate how to apply WOE and IV.

library(kableExtra)

library(dplyr)

library(tidyr)

library(stringr)
library(purrr)

library(glmnet)

HW4<-read.csv("https://raw.githubusercontent.com/simplymathematics/621/master/HW4/insurance_training_data.csv")

head(HW4)

##   INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ    INCOME PARENT1
## 1     1           0          0        0  60        0  11  $67,349       No
## 2     2           0          0        0  43        0  11  $91,449       No
## 3     4           0          0        0  35        1  10  $16,039       No
## 4     5           0          0        0  51        0  14                No
## 5     6           0          0        0  50        0  NA $114,986       No
## 6     7           1       2946        0  34        1  12 $125,301      Yes
##    HOME_VAL MSTATUS SEX     EDUCATION           JOB TRAVTIME    CAR_USE
## 1       $0     z_No   M           PhD  Professional       14    Private
## 2 $257,252     z_No   M z_High School z_Blue Collar       22 Commercial
## 3 $124,191      Yes z_F z_High School      Clerical        5    Private
## 4 $306,251      Yes   M  <High School z_Blue Collar       32    Private
## 5 $243,925      Yes z_F           PhD        Doctor       36    Private
## 6       $0     z_No z_F     Bachelors z_Blue Collar       46 Commercial
##   BLUEBOOK TIF   CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1 $14,230   11    Minivan     yes  $4,461         2      No       3
## 2 $14,940    1    Minivan     yes      $0         0      No       0
## 3  $4,010    4      z_SUV      no $38,690         2      No       3
## 4 $15,440    7    Minivan     yes      $0         0      No       0
## 5 $18,000    1      z_SUV      no $19,217         2     Yes       3
## 6 $17,430    1 Sports Car      no      $0         0      No       0
##   CAR_AGE          URBANICITY
## 1      18 Highly Urban/ Urban
## 2       1 Highly Urban/ Urban
## 3      10 Highly Urban/ Urban
## 4       6 Highly Urban/ Urban
## 5      17 Highly Urban/ Urban
## 6       7 Highly Urban/ Urban

We have the following variables of interest:

Education
Job
Car_Type

Let’s create tables for them:

HW4$JOB<-as.character(HW4$JOB)

HW4$JOB[HW4$JOB=='']<-'Missing'

HW4$OLDCLAIM <- as.numeric(gsub('[$,]', '', HW4$OLDCLAIM))

HW4$BLUEBOOK <- as.numeric(gsub('[$,]', '', HW4$BLUEBOOK))

HW4$INCOME <- as.numeric(gsub('[$,]', '', HW4$INCOME))

HW4$INCOME[is.na(HW4$INCOME)]<-mean(HW4$INCOME,na.rm=TRUE)

HW4$HOME_VAL <- as.numeric(gsub('[$,]', '', HW4$HOME_VAL))

HW4$HOME_VAL[is.na(HW4$HOME_VAL)]<-mean(HW4$HOME_VAL,na.rm=TRUE)
#cor(HW4$TARGET_FLAG,HW4$HOME_VAL)
HW4$AGE[is.na(HW4$AGE)]<-mean(HW4$AGE,na.rm=TRUE)

#HW4%>%filter(is.na(YOJ))%>%arrange(JOB)

HW4$YOJ[is.na(HW4$YOJ)]<-mean(HW4$YOJ,na.rm=TRUE)

HW4$CAR_AGE[is.na(HW4$CAR_AGE)]<-mean(HW4$CAR_AGE,na.rm=TRUE)

kable(round(prop.table(table(HW4$TARGET_FLAG,HW4$EDUCATION),1),2))

	<High School	Bachelors	Masters	PhD	z_High School
0	0.14	0.29	0.22	0.10	0.26
1	0.18	0.24	0.15	0.06	0.37

kable(round(prop.table(table(HW4$TARGET_FLAG, HW4$JOB),1),2))

	Clerical	Doctor	Home Maker	Lawyer	Manager	Missing	Professional	Student	z_Blue Collar
0	0.15	0.04	0.08	0.11	0.14	0.06	0.14	0.07	0.20
1	0.17	0.01	0.08	0.07	0.06	0.06	0.11	0.12	0.29

kable(round(prop.table(table(HW4$TARGET_FLAG,HW4$CAR_TYPE),1),2))

	Minivan	Panel Truck	Pickup	Sports Car	Van	z_SUV
0	0.30	0.08	0.16	0.10	0.09	0.27
1	0.16	0.08	0.21	0.14	0.09	0.31

#kable(summary(HW4))

#HW4%>%select_if(is.numeric)%>%cor()

From tables we can see that Education seems to have predictive power. Even though, less than High School and High School could be combined as well as Masters and PhD. Job and Car Type variables seem to have predictive power too. We might be able to combine Van And Truck.

Let’s do education

woe_ed<-as.data.frame(prop.table(table(HW4$TARGET_FLAG,HW4$EDUCATION),1))

colnames(woe_ed)<-c("flag","educ","value")

twoe_ed<-woe_ed%>%spread(flag,value)

#twoe_ed

colnames(twoe_ed)<-c("educ","ne","e")

twoe_ed$woe<-log(twoe_ed$ne/twoe_ed$e)

twoe_ed$woe1<-(twoe_ed$ne-twoe_ed$e)*twoe_ed$woe

#twoe_ed

paste0('IV=',round(sum(twoe_ed$woe1),2))

## [1] "IV=0.11"

Actually Education has only medium predictive power, IV is only 0.11.

Information Value Predictive Power < 0.02 useless for prediction 0.02 to 0.1 Weak predictor 0.1 to 0.3 Medium predictor 0.3 to 0.5 Strong predictor >0.5 Suspicious or too good to be true

Source: http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/

woe_j<-as.data.frame(prop.table(table(HW4$TARGET_FLAG,HW4$JOB),1))

colnames(woe_j)<-c("flag","j","value")

twoe_j<-woe_j%>%spread(flag,value)

#twoe_j

colnames(twoe_j)<-c("j","ne","e")

twoe_j$woe<-log(twoe_j$ne/twoe_j$e)

twoe_j$woe1<-(twoe_j$ne-twoe_j$e)*twoe_j$woe

#twoe_j

paste0('IV=',round(sum(twoe_j$woe1),2))

## [1] "IV=0.18"

Jobs have medium predictive power too.

woe_c<-as.data.frame(prop.table(table(HW4$TARGET_FLAG,HW4$CAR_TYPE),1))

colnames(woe_c)<-c("flag","c","value")

twoe_c<-woe_c%>%spread(flag,value)

colnames(twoe_c)<-c("c","ne","e")

twoe_c$woe<-log(twoe_c$ne/twoe_c$e)

twoe_c$woe1<-(twoe_c$ne-twoe_c$e)*twoe_c$woe

#twoe_c

paste0('IV=',round(sum(twoe_c$woe1),2))

## [1] "IV=0.12"

Car Type has a medium predictive power.

None of our variables has strong predictive power.

We can also convert some of numerical variables to a factor and see if they are useful as predictors.

Such as KIDSDRIV.

#cor(HW4$TARGET_FLAG,HW4$KIDSDRIV)

HW4$KIDSDRIVf<-as.factor(HW4$KIDSDRIV)

library(tidyr)

woe_k<-as.data.frame(prop.table(table(HW4$TARGET_FLAG,HW4$KIDSDRIVf),1))

colnames(woe_k)<-c("flag","k","value")

twoe_k<-woe_k%>%spread(flag,value)

colnames(twoe_k)<-c("k","ne","e")

twoe_k$woe<-log(twoe_k$ne/twoe_k$e)

twoe_k$woe1<-(twoe_k$ne-twoe_k$e)*twoe_k$woe

paste0('IV=',round(sum(twoe_k$woe1),2))

## [1] "IV=0.05"

Number of kids driving does not have much predictive power.

Let’s look at HOMEKIDSf.

#cor(HW4$TARGET_FLAG,HW4$HOMEKIDS)

HW4$HOMEKIDSf<-as.factor(HW4$HOMEKIDS)

woe_h<-as.data.frame(prop.table(table(HW4$TARGET_FLAG,HW4$HOMEKIDSf),1))

colnames(woe_h)<-c("flag","h","value")

twoe_h<-woe_h%>%spread(flag,value)

colnames(twoe_h)<-c("h","ne","e")

twoe_h$woe<-log(twoe_h$ne/twoe_h$e)

twoe_h$woe1<-(twoe_h$ne-twoe_h$e)*twoe_h$woe

#twoe_h

paste0('IV=',round(sum(twoe_h$woe1),2))

## [1] "IV=0.08"

Children at home only have a weak predictive power.

R Function to Calculate WOE.

Function “woe”.

R function “woe” calculates IV for us. Let’s run it.

#install.packages("klaR")

library(klaR)

new<-data.frame(HW4)
new$JOB<-as.factor(new$JOB)
new$YOJ<-as.factor(new$YOJ)

new$TIF<-as.factor(new$TIF)
new$INCOME<-as.factor(new$INCOME)
new$BLUE_INCOME<-as.factor(new$BLUEBOOK)
new$HOME_VAL<-as.factor(new$HOME_VAL)
new$TRAVTIME<-as.factor(new$TRAVTIME)
new$AGEf<-as.factor(new$AGE)
new$OLDCLAIMf<-as.factor(new$OLDCLAIM)
new$CLM_FREQ<-as.factor(new$CLM_FREQ)
new$MVR_PTS<-as.factor(new$MVR_PTS)
new$CAR_AGE<-as.factor(new$CAR_AGE)
new$TARGET_FLAG<-as.factor(new$TARGET_FLAG)
woemodel <- woe(TARGET_FLAG~., data = new, zeroadj=0.5, applyontrain = TRUE)

## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!

woemodel

## Information values of transformed variables: 
## 
##                       IV
## OLDCLAIMf   0.7175059273
## BLUE_INCOME 0.7097344647
## INCOME      0.4096604413
## URBANICITY  0.3739130414
## HOME_VAL    0.3050147636
## CLM_FREQ    0.2967500330
## MVR_PTS     0.2436927720
## AGEf        0.1908587640
## JOB         0.1786496970
## CAR_TYPE    0.1178921685
## PARENT1     0.1134890318
## EDUCATION   0.1091581941
## REVOKED     0.1051712794
## CAR_USE     0.1022645836
## MSTATUS     0.0925496268
## HOMEKIDSf   0.0841196372
## CAR_AGE     0.0683451826
## TRAVTIME    0.0655595072
## TIF         0.0530542121
## YOJ         0.0529557779
## KIDSDRIVf   0.0527743301
## SEX         0.0022921661
## RED_CAR     0.0002493732

The function shows that if we convert all variables as factors, then OLDCLAIM has the most predictive power.

Function “rbin”.

Let’s look at another R functionv- “rbin”. This function allows us to allocate variables into bins.

library(rbin)

bins <- rbin_manual(new, TARGET_FLAG, AGE, c(34, 38, 41, 43, 45, 47, 49, 52,56))

bins

## Binning Summary
## ---------------------------
## Method               Manual 
## Response             TARGET_FLAG 
## Predictor            AGE 
## Bins                 10 
## Count                8161 
## Goods                2153 
## Bads                 6008 
## Entropy              0.82 
## Information Value    0.12 
## 
## 
## # A tibble: 10 x 7
##    cut_point bin_count  good   bad     woe       iv entropy
##    <chr>         <int> <int> <int>   <dbl>    <dbl>   <dbl>
##  1 < 34            813   328   485 -0.635  0.0455     0.973
##  2 < 38            829   269   560 -0.293  0.00930    0.909
##  3 < 41            904   267   637 -0.157  0.00282    0.876
##  4 < 43            669   185   484 -0.0645 0.000346   0.851
##  5 < 45            693   170   523  0.0976 0.000789   0.804
##  6 < 47            777   173   604  0.224  0.00452    0.765
##  7 < 49            718   147   571  0.331  0.00885    0.731
##  8 < 52            958   168   790  0.522  0.0279     0.670
##  9 < 56            942   183   759  0.396  0.0164     0.710
## 10 >= 56           858   263   595 -0.210  0.00485    0.889

#rbin_create(new, AGE, bins)

bins1 <- rbin_manual(new, TARGET_FLAG, OLDCLAIM, c(1, 2000, 3500, 4500, 5800, 7000, 8500, 12000, 30000))

bins1

## Binning Summary
## ---------------------------
## Method               Manual 
## Response             TARGET_FLAG 
## Predictor            OLDCLAIM 
## Bins                 10 
## Count                8161 
## Goods                2153 
## Bads                 6008 
## Entropy              0.79 
## Information Value    0.3 
## 
## 
## # A tibble: 10 x 7
##    cut_point bin_count  good   bad    woe     iv entropy
##    <chr>         <int> <int> <int>  <dbl>  <dbl>   <dbl>
##  1 < 1            5009   898  4111  0.495 0.132    0.678
##  2 < 2000          365   138   227 -0.529 0.0139   0.957
##  3 < 3500          370   153   217 -0.677 0.0236   0.978
##  4 < 4500          322   139   183 -0.751 0.0256   0.986
##  5 < 5800          438   163   275 -0.503 0.0151   0.952
##  6 < 7000          364   138   226 -0.533 0.0141   0.957
##  7 < 8500          338   137   201 -0.643 0.0194   0.974
##  8 < 12000         302   131   171 -0.760 0.0246   0.987
##  9 < 30000         311   122   189 -0.589 0.0148   0.966
## 10 >= 30000        342   134   208 -0.587 0.0162   0.966

#rbin_create(new, OLDCLAIM, bins1)

Using LASSO for Comparison.

Let’s use LASSO to compare LASSO’s results to the variables selected by WOE/IV.

ldata<-HW4
ldata$JOB<-as.factor(ldata$JOB)
ldata<-rbin_create(ldata, AGE, bins)
ldata<-rbin_create(ldata, OLDCLAIM, bins1)
vars_nameh<-ldata%>%select_if(is.factor)%>%colnames()%>%str_c(collapse = "+")
vars_nameh1<-ldata[,-c(1:3)]%>%select_if(negate(is.factor))

model_stringh <- paste("TARGET_FLAG  ~",vars_nameh )
x_trainh <- model.matrix(as.formula(model_stringh), HW4)
xh<- as.matrix(x_trainh)

x1h       <- as.matrix(vars_nameh1)
x2h<-cbind(xh,x1h)

#head(x2h)

#dim(x2h)

dXh<- apply(x2h, 2, function(y) (y - mean(y)) / sd(y) ^ as.logical(sd(y)))

#dim(dXh)

#head(dXh)

dX1h<-cbind(dXh,ldata[,c(2)])
#head(dX1h)
#typeof(dX1h)
trainh <- as.data.frame(dX1h)
#dim(trainh)
#colnames(trainh)
colnames(trainh)[66] <- "TARGET_FLAG"

dYh<-trainh[,66]

#typeof(trainh)

#class(trainh$TARGET_FLAG)
#trainh$KIDSDRIV<-NULL
xh <- model.matrix(TARGET_FLAG ~ .+poly(INCOME,3)+poly(OLDCLAIM,3)+poly(HOME_VAL,3)+poly(BLUEBOOK,3), data = trainh)
#head(trainh)


cv.lasso<- cv.glmnet(x=xh, y=dYh, family = "binomial", alpha = 1)

plot(cv.lasso)

#cv.lasso$lambda.min

cv.lasso$lambda.1se

## [1] 0.007304067

#coef(cv.lasso, s=cv.lasso$lambda.min)

coef(cv.lasso, s=cv.lasso$lambda.1se)

## 79 x 1 sparse Matrix of class "dgCMatrix"
##                                               1
## (Intercept)                       -1.3104377786
## (Intercept)                        .           
## `(Intercept)`                      .           
## PARENT1Yes                         0.1019857547
## MSTATUSz_No                        0.1862842126
## SEXz_F                             .           
## EDUCATIONBachelors                -0.0255323682
## EDUCATIONMasters                  -0.0012823853
## EDUCATIONPhD                       .           
## `EDUCATIONz_High School`           0.0635828701
## JOBDoctor                         -0.0383850784
## `JOBHome Maker`                    .           
## JOBLawyer                          .           
## JOBManager                        -0.1840282732
## JOBMissing                         .           
## JOBProfessional                    .           
## JOBStudent                         .           
## `JOBz_Blue Collar`                 0.0310322427
## CAR_USEPrivate                    -0.3322171711
## `CAR_TYPEPanel Truck`              .           
## CAR_TYPEPickup                     0.0400098915
## `CAR_TYPESports Car`               0.1282677628
## CAR_TYPEVan                        0.0290287571
## CAR_TYPEz_SUV                      0.1194391451
## RED_CARyes                         .           
## REVOKEDYes                         0.2318882048
## `URBANICITYz_Highly Rural/ Rural` -0.7765100430
## KIDSDRIVf1                         0.0340810468
## KIDSDRIVf2                         .           
## KIDSDRIVf3                         .           
## KIDSDRIVf4                         .           
## HOMEKIDSf1                         0.0108753957
## HOMEKIDSf2                         .           
## HOMEKIDSf3                         .           
## HOMEKIDSf4                         .           
## HOMEKIDSf5                         .           
## KIDSDRIV                           0.1548089487
## AGE                               -0.1104398475
## HOMEKIDS                           .           
## YOJ                                .           
## INCOME                            -0.1725923945
## HOME_VAL                          -0.1569225832
## TRAVTIME                           0.1613351379
## BLUEBOOK                          -0.1158974439
## TIF                               -0.1587347161
## OLDCLAIM                           .           
## CLM_FREQ                           0.1057836448
## MVR_PTS                            0.2002015822
## CAR_AGE                           -0.0539490387
## `AGE_<_38`                         .           
## `AGE_<_41`                         .           
## `AGE_<_43`                         .           
## `AGE_<_45`                         .           
## `AGE_<_47`                        -0.0032083445
## `AGE_<_49`                         .           
## `AGE_<_52`                        -0.0346096519
## `AGE_<_56`                         .           
## `AGE_>=_56`                        0.1706288966
## `OLDCLAIM_<_2000`                  .           
## `OLDCLAIM_<_3500`                  0.0008731253
## `OLDCLAIM_<_4500`                 -0.0153720195
## `OLDCLAIM_<_5800`                  0.0098188586
## `OLDCLAIM_<_7000`                  0.0148232776
## `OLDCLAIM_<_8500`                  .           
## `OLDCLAIM_<_12000`                 .           
## `OLDCLAIM_<_30000`                 .           
## `OLDCLAIM_>=_30000`                .           
## poly(INCOME, 3)1                  -1.0211554255
## poly(INCOME, 3)2                   4.7347222698
## poly(INCOME, 3)3                   .           
## poly(OLDCLAIM, 3)1                 .           
## poly(OLDCLAIM, 3)2                -7.7594224933
## poly(OLDCLAIM, 3)3                 3.9617917508
## poly(HOME_VAL, 3)1                -0.0011403323
## poly(HOME_VAL, 3)2                 0.1789111559
## poly(HOME_VAL, 3)3                 .           
## poly(BLUEBOOK, 3)1                -0.9498716995
## poly(BLUEBOOK, 3)2                 5.9877334624
## poly(BLUEBOOK, 3)3                 .

LASSO choose OLDCLAIM polynomial as the strongest predictor, which is possibly better fit than binnig OLDCLAIM.

Creating a Model

Let’s try to fit our variables into model and test the model.

mdata<-HW4

mdata$linc<-log(HW4$INCOME+0.01)

bins <- rbin_manual(mdata, TARGET_FLAG, AGE, c(34, 38, 41, 43, 45, 47, 49, 52,56))

#bins
mdata<-rbin_create(mdata, AGE, bins)
#mdata
bins1 <- rbin_manual(mdata, TARGET_FLAG, OLDCLAIM, c(1, 2000, 3500, 4500, 5800, 7000, 8500, 12000, 30000))


mdata<-rbin_create(mdata, OLDCLAIM, bins1)
mdata$MVR_PTSf<-as.factor(mdata$MVR_PTS)

smp_size <- floor(0.75 * nrow(mdata))

## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(mdata)), size = smp_size)

train <- mdata[train_ind, ]
test <- mdata[-train_ind, ]

#train
logitMod <- glm(TARGET_FLAG ~ URBANICITY+MVR_PTS+REVOKED+CAR_USE+JOB+poly(INCOME,2)+I((INCOME+0.01)^(-1))+TRAVTIME+I(TRAVTIME^2)+MSTATUS+TIF+CAR_TYPE+HOME_VAL+PARENT1+BLUEBOOK+I(log(BLUEBOOK))+KIDSDRIV+EDUCATION+I(AGE^0.5)+I(AGE^(-1))+YOJ+`AGE_<_52`+`OLDCLAIM_<_2000`+`OLDCLAIM_<_3500`+`OLDCLAIM_<_5800`+`OLDCLAIM_<_7000`+`OLDCLAIM_<_12000`+`OLDCLAIM_<_30000`+poly(OLDCLAIM,3),data=train,family=binomial(link="logit"))

summary(logitMod)

## 
## Call:
## glm(formula = TARGET_FLAG ~ URBANICITY + MVR_PTS + REVOKED + 
##     CAR_USE + JOB + poly(INCOME, 2) + I((INCOME + 0.01)^(-1)) + 
##     TRAVTIME + I(TRAVTIME^2) + MSTATUS + TIF + CAR_TYPE + HOME_VAL + 
##     PARENT1 + BLUEBOOK + I(log(BLUEBOOK)) + KIDSDRIV + EDUCATION + 
##     I(AGE^0.5) + I(AGE^(-1)) + YOJ + `AGE_<_52` + `OLDCLAIM_<_2000` + 
##     `OLDCLAIM_<_3500` + `OLDCLAIM_<_5800` + `OLDCLAIM_<_7000` + 
##     `OLDCLAIM_<_12000` + `OLDCLAIM_<_30000` + poly(OLDCLAIM, 
##     3), family = binomial(link = "logit"), data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.3157  -0.6968  -0.3833   0.5754   3.0636  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                     -5.971e+00  2.386e+00  -2.502 0.012349 *  
## URBANICITYz_Highly Rural/ Rural -2.312e+00  1.292e-01 -17.899  < 2e-16 ***
## MVR_PTS                          9.472e-02  1.645e-02   5.759 8.47e-09 ***
## REVOKEDYes                       9.660e-01  1.106e-01   8.737  < 2e-16 ***
## CAR_USEPrivate                  -7.525e-01  1.069e-01  -7.040 1.92e-12 ***
## JOBDoctor                       -7.761e-01  3.336e-01  -2.326 0.019999 *  
## JOBHome Maker                   -4.950e-01  1.810e-01  -2.734 0.006256 ** 
## JOBLawyer                       -2.863e-01  2.202e-01  -1.300 0.193572    
## JOBManager                      -9.749e-01  1.695e-01  -5.750 8.93e-09 ***
## JOBMissing                      -3.426e-01  2.303e-01  -1.488 0.136826    
## JOBProfessional                 -1.829e-01  1.475e-01  -1.239 0.215163    
## JOBStudent                      -3.753e-01  1.658e-01  -2.264 0.023562 *  
## JOBz_Blue Collar                 3.382e-03  1.278e-01   0.026 0.978894    
## poly(INCOME, 2)1                -1.333e+01  4.713e+00  -2.829 0.004675 ** 
## poly(INCOME, 2)2                 7.255e+00  3.723e+00   1.949 0.051332 .  
## I((INCOME + 0.01)^(-1))          7.036e-03  2.173e-03   3.237 0.001207 ** 
## TRAVTIME                         3.467e-02  7.948e-03   4.362 1.29e-05 ***
## I(TRAVTIME^2)                   -2.600e-04  1.053e-04  -2.469 0.013556 *  
## MSTATUSz_No                      5.294e-01  9.579e-02   5.526 3.27e-08 ***
## TIF                             -6.111e-02  8.656e-03  -7.059 1.67e-12 ***
## CAR_TYPEPanel Truck              5.242e-01  1.786e-01   2.935 0.003334 ** 
## CAR_TYPEPickup                   6.259e-01  1.183e-01   5.292 1.21e-07 ***
## CAR_TYPESports Car               9.279e-01  1.268e-01   7.319 2.50e-13 ***
## CAR_TYPEVan                      7.399e-01  1.430e-01   5.174 2.30e-07 ***
## CAR_TYPEz_SUV                    7.663e-01  1.019e-01   7.518 5.56e-14 ***
## HOME_VAL                        -1.275e-06  3.943e-07  -3.233 0.001224 ** 
## PARENT1Yes                       3.574e-01  1.190e-01   3.003 0.002671 ** 
## BLUEBOOK                         1.039e-05  1.206e-05   0.861 0.388983    
## I(log(BLUEBOOK))                -4.605e-01  1.447e-01  -3.182 0.001460 ** 
## KIDSDRIV                         4.592e-01  6.656e-02   6.898 5.26e-12 ***
## EDUCATIONBachelors              -3.116e-01  1.322e-01  -2.358 0.018384 *  
## EDUCATIONMasters                -2.337e-01  1.928e-01  -1.213 0.225312    
## EDUCATIONPhD                    -2.033e-01  2.373e-01  -0.857 0.391514    
## EDUCATIONz_High School           1.245e-01  1.128e-01   1.103 0.269956    
## I(AGE^0.5)                       8.291e-01  2.149e-01   3.859 0.000114 ***
## I(AGE^(-1))                      1.163e+02  2.698e+01   4.311 1.62e-05 ***
## YOJ                              3.300e-02  1.293e-02   2.551 0.010741 *  
## `AGE_<_52`                      -3.810e-01  1.198e-01  -3.181 0.001466 ** 
## `OLDCLAIM_<_2000`                1.894e-01  2.060e-01   0.919 0.357840    
## `OLDCLAIM_<_3500`                2.579e-01  1.517e-01   1.700 0.089103 .  
## `OLDCLAIM_<_5800`                3.206e-01  1.562e-01   2.052 0.040175 *  
## `OLDCLAIM_<_7000`                3.966e-01  1.698e-01   2.336 0.019498 *  
## `OLDCLAIM_<_12000`               1.025e-01  1.794e-01   0.571 0.567835    
## `OLDCLAIM_<_30000`               2.395e-01  1.918e-01   1.249 0.211683    
## poly(OLDCLAIM, 3)1               1.369e+00  2.930e+00   0.467 0.640324    
## poly(OLDCLAIM, 3)2              -1.062e+01  3.536e+00  -3.002 0.002681 ** 
## poly(OLDCLAIM, 3)3               8.134e+00  3.269e+00   2.488 0.012838 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7053.2  on 6119  degrees of freedom
## Residual deviance: 5337.1  on 6073  degrees of freedom
## AIC: 5431.1
## 
## Number of Fisher Scoring iterations: 5

Testing Our Model

We do testing and our model produces 0.75 accuracy.

library(pscl)

pR2(logitMod)

##           llh       llhNull            G2      McFadden          r2ML 
## -2668.5727829 -3526.6228402  1716.1001145     0.2433064     0.2445249 
##          r2CU 
##     0.3574137

test$p<-predict(logitMod, test, type ="response")

test$p1<-ifelse(test$p<0.31,0,1)
auc(test$TARGET_FLAG,test$p1)

## [1] 0.7437326

table(test$p1,test$TARGET_FLAG)

##    
##        0    1
##   0 1133  146
##   1  365  397

library(caret)

confusionMatrix(table(test$p1,test$TARGET_FLAG)[2:1,2:1])

## Confusion Matrix and Statistics
## 
##    
##        1    0
##   1  397  365
##   0  146 1133
##                                           
##                Accuracy : 0.7496          
##                  95% CI : (0.7302, 0.7683)
##     No Information Rate : 0.734           
##     P-Value [Acc > NIR] : 0.0566          
##                                           
##                   Kappa : 0.4319          
##  Mcnemar's Test P-Value : <2e-16          
##                                           
##             Sensitivity : 0.7311          
##             Specificity : 0.7563          
##          Pos Pred Value : 0.5210          
##          Neg Pred Value : 0.8858          
##              Prevalence : 0.2660          
##          Detection Rate : 0.1945          
##    Detection Prevalence : 0.3733          
##       Balanced Accuracy : 0.7437          
##                                           
##        'Positive' Class : 1               
##

library(ROCR)
ROCRpred <- prediction(test$p, test$TARGET_FLAG)
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))

#for (i in seq(0, 1, by=0.01)) {
#test$p1<-ifelse(test$p<i,0,1)
#print(i)
#print(auc(test$TARGET_FLAG,test$p1))
#}

Data 621. Blog Post 3. WOE (Weight of Evidence) and IV (Information Value). 05/18/2019.

Mikhail Groysman

May 19, 2019