WOE and IV are used for select categorical variables (and sometimes non-categorical too) in logistical regression. It is not the most common method, nevertheless the method could be useful, if we have many categorical variables and we want to decide which ones are the better fit for our logistical model.
My source is http://support.sas.com/documentation/cdl/en/prochp/66704/HTML/default/viewer.htm#prochp_hpbin_details02.htm.
If \(X\) is a categorical independent variable, \(Y\) is a binary dependent variable, where \(Y=1\) is an event and \(Y=0\) is a non-event, then:
\(WOE_x= ln(\frac {p^{non-vent}_{attribute}} {p^{event}_{attribute}})= ln \left (\frac { \frac { {N^{non-event}_{attribute}}} {N^{total}_{attribute} }} {\frac{ {N^{event}_{attribute}}} {N^{total}_{attribute}} } \right)\),
where \(p\) - probability of \(Y\). And \(N\) number of \(Y=1\), \(Y=0\), or total.
Sometimes \(p\) will be 0, and we do not want it, as \(ln\) of 0 is undefined. To address it, we can add \(x\) (commonly \(x=0.5\)) to our calulations:
\(WOE_x= ln \left (\frac { \frac { {N^{non-event}_{attribute}+x}} {N^{total}_{attribute} }} {\frac{ {N^{event}_{attribute}+x}} {N^{total}_{attribute}} } \right)\)
IV is our final results, that should help us to decide if we should use \(X\) in our logistical model or not.
\(IV= \sum_{i=1}^m \left (\frac {N^{non-event}_{attribute}+x} {N^{total}_{attribute}} - \frac {N^{event}_{attribute}+x} {N^{total}_{attribute}} \right)*WOE_i\)
Generally speaking the higher IV indicates a better predictive power of \(X\).
I will use HW4 data to demonstrate how to apply WOE and IV.
library(kableExtra)
library(dplyr)
library(tidyr)
library(stringr)
library(purrr)
library(glmnet)
HW4<-read.csv("https://raw.githubusercontent.com/simplymathematics/621/master/HW4/insurance_training_data.csv")
head(HW4)
## INDEX TARGET_FLAG TARGET_AMT KIDSDRIV AGE HOMEKIDS YOJ INCOME PARENT1
## 1 1 0 0 0 60 0 11 $67,349 No
## 2 2 0 0 0 43 0 11 $91,449 No
## 3 4 0 0 0 35 1 10 $16,039 No
## 4 5 0 0 0 51 0 14 No
## 5 6 0 0 0 50 0 NA $114,986 No
## 6 7 1 2946 0 34 1 12 $125,301 Yes
## HOME_VAL MSTATUS SEX EDUCATION JOB TRAVTIME CAR_USE
## 1 $0 z_No M PhD Professional 14 Private
## 2 $257,252 z_No M z_High School z_Blue Collar 22 Commercial
## 3 $124,191 Yes z_F z_High School Clerical 5 Private
## 4 $306,251 Yes M <High School z_Blue Collar 32 Private
## 5 $243,925 Yes z_F PhD Doctor 36 Private
## 6 $0 z_No z_F Bachelors z_Blue Collar 46 Commercial
## BLUEBOOK TIF CAR_TYPE RED_CAR OLDCLAIM CLM_FREQ REVOKED MVR_PTS
## 1 $14,230 11 Minivan yes $4,461 2 No 3
## 2 $14,940 1 Minivan yes $0 0 No 0
## 3 $4,010 4 z_SUV no $38,690 2 No 3
## 4 $15,440 7 Minivan yes $0 0 No 0
## 5 $18,000 1 z_SUV no $19,217 2 Yes 3
## 6 $17,430 1 Sports Car no $0 0 No 0
## CAR_AGE URBANICITY
## 1 18 Highly Urban/ Urban
## 2 1 Highly Urban/ Urban
## 3 10 Highly Urban/ Urban
## 4 6 Highly Urban/ Urban
## 5 17 Highly Urban/ Urban
## 6 7 Highly Urban/ Urban
We have the following variables of interest:
Let’s create tables for them:
HW4$JOB<-as.character(HW4$JOB)
HW4$JOB[HW4$JOB=='']<-'Missing'
HW4$OLDCLAIM <- as.numeric(gsub('[$,]', '', HW4$OLDCLAIM))
HW4$BLUEBOOK <- as.numeric(gsub('[$,]', '', HW4$BLUEBOOK))
HW4$INCOME <- as.numeric(gsub('[$,]', '', HW4$INCOME))
HW4$INCOME[is.na(HW4$INCOME)]<-mean(HW4$INCOME,na.rm=TRUE)
HW4$HOME_VAL <- as.numeric(gsub('[$,]', '', HW4$HOME_VAL))
HW4$HOME_VAL[is.na(HW4$HOME_VAL)]<-mean(HW4$HOME_VAL,na.rm=TRUE)
#cor(HW4$TARGET_FLAG,HW4$HOME_VAL)
HW4$AGE[is.na(HW4$AGE)]<-mean(HW4$AGE,na.rm=TRUE)
#HW4%>%filter(is.na(YOJ))%>%arrange(JOB)
HW4$YOJ[is.na(HW4$YOJ)]<-mean(HW4$YOJ,na.rm=TRUE)
HW4$CAR_AGE[is.na(HW4$CAR_AGE)]<-mean(HW4$CAR_AGE,na.rm=TRUE)
kable(round(prop.table(table(HW4$TARGET_FLAG,HW4$EDUCATION),1),2))
| <High School | Bachelors | Masters | PhD | z_High School | |
|---|---|---|---|---|---|
| 0 | 0.14 | 0.29 | 0.22 | 0.10 | 0.26 |
| 1 | 0.18 | 0.24 | 0.15 | 0.06 | 0.37 |
kable(round(prop.table(table(HW4$TARGET_FLAG, HW4$JOB),1),2))
| Clerical | Doctor | Home Maker | Lawyer | Manager | Missing | Professional | Student | z_Blue Collar | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.15 | 0.04 | 0.08 | 0.11 | 0.14 | 0.06 | 0.14 | 0.07 | 0.20 |
| 1 | 0.17 | 0.01 | 0.08 | 0.07 | 0.06 | 0.06 | 0.11 | 0.12 | 0.29 |
kable(round(prop.table(table(HW4$TARGET_FLAG,HW4$CAR_TYPE),1),2))
| Minivan | Panel Truck | Pickup | Sports Car | Van | z_SUV | |
|---|---|---|---|---|---|---|
| 0 | 0.30 | 0.08 | 0.16 | 0.10 | 0.09 | 0.27 |
| 1 | 0.16 | 0.08 | 0.21 | 0.14 | 0.09 | 0.31 |
#kable(summary(HW4))
#HW4%>%select_if(is.numeric)%>%cor()
From tables we can see that Education seems to have predictive power. Even though, less than High School and High School could be combined as well as Masters and PhD. Job and Car Type variables seem to have predictive power too. We might be able to combine Van And Truck.
Let’s do education
woe_ed<-as.data.frame(prop.table(table(HW4$TARGET_FLAG,HW4$EDUCATION),1))
colnames(woe_ed)<-c("flag","educ","value")
twoe_ed<-woe_ed%>%spread(flag,value)
#twoe_ed
colnames(twoe_ed)<-c("educ","ne","e")
twoe_ed$woe<-log(twoe_ed$ne/twoe_ed$e)
twoe_ed$woe1<-(twoe_ed$ne-twoe_ed$e)*twoe_ed$woe
#twoe_ed
paste0('IV=',round(sum(twoe_ed$woe1),2))
## [1] "IV=0.11"
Actually Education has only medium predictive power, IV is only 0.11.
Information Value Predictive Power < 0.02 useless for prediction 0.02 to 0.1 Weak predictor 0.1 to 0.3 Medium predictor 0.3 to 0.5 Strong predictor >0.5 Suspicious or too good to be true
Source: http://ucanalytics.com/blogs/information-value-and-weight-of-evidencebanking-case/
woe_j<-as.data.frame(prop.table(table(HW4$TARGET_FLAG,HW4$JOB),1))
colnames(woe_j)<-c("flag","j","value")
twoe_j<-woe_j%>%spread(flag,value)
#twoe_j
colnames(twoe_j)<-c("j","ne","e")
twoe_j$woe<-log(twoe_j$ne/twoe_j$e)
twoe_j$woe1<-(twoe_j$ne-twoe_j$e)*twoe_j$woe
#twoe_j
paste0('IV=',round(sum(twoe_j$woe1),2))
## [1] "IV=0.18"
Jobs have medium predictive power too.
woe_c<-as.data.frame(prop.table(table(HW4$TARGET_FLAG,HW4$CAR_TYPE),1))
colnames(woe_c)<-c("flag","c","value")
twoe_c<-woe_c%>%spread(flag,value)
colnames(twoe_c)<-c("c","ne","e")
twoe_c$woe<-log(twoe_c$ne/twoe_c$e)
twoe_c$woe1<-(twoe_c$ne-twoe_c$e)*twoe_c$woe
#twoe_c
paste0('IV=',round(sum(twoe_c$woe1),2))
## [1] "IV=0.12"
Car Type has a medium predictive power.
None of our variables has strong predictive power.
We can also convert some of numerical variables to a factor and see if they are useful as predictors.
Such as KIDSDRIV.
#cor(HW4$TARGET_FLAG,HW4$KIDSDRIV)
HW4$KIDSDRIVf<-as.factor(HW4$KIDSDRIV)
library(tidyr)
woe_k<-as.data.frame(prop.table(table(HW4$TARGET_FLAG,HW4$KIDSDRIVf),1))
colnames(woe_k)<-c("flag","k","value")
twoe_k<-woe_k%>%spread(flag,value)
colnames(twoe_k)<-c("k","ne","e")
twoe_k$woe<-log(twoe_k$ne/twoe_k$e)
twoe_k$woe1<-(twoe_k$ne-twoe_k$e)*twoe_k$woe
paste0('IV=',round(sum(twoe_k$woe1),2))
## [1] "IV=0.05"
Number of kids driving does not have much predictive power.
Let’s look at HOMEKIDSf.
#cor(HW4$TARGET_FLAG,HW4$HOMEKIDS)
HW4$HOMEKIDSf<-as.factor(HW4$HOMEKIDS)
woe_h<-as.data.frame(prop.table(table(HW4$TARGET_FLAG,HW4$HOMEKIDSf),1))
colnames(woe_h)<-c("flag","h","value")
twoe_h<-woe_h%>%spread(flag,value)
colnames(twoe_h)<-c("h","ne","e")
twoe_h$woe<-log(twoe_h$ne/twoe_h$e)
twoe_h$woe1<-(twoe_h$ne-twoe_h$e)*twoe_h$woe
#twoe_h
paste0('IV=',round(sum(twoe_h$woe1),2))
## [1] "IV=0.08"
Children at home only have a weak predictive power.
R function “woe” calculates IV for us. Let’s run it.
#install.packages("klaR")
library(klaR)
new<-data.frame(HW4)
new$JOB<-as.factor(new$JOB)
new$YOJ<-as.factor(new$YOJ)
new$TIF<-as.factor(new$TIF)
new$INCOME<-as.factor(new$INCOME)
new$BLUE_INCOME<-as.factor(new$BLUEBOOK)
new$HOME_VAL<-as.factor(new$HOME_VAL)
new$TRAVTIME<-as.factor(new$TRAVTIME)
new$AGEf<-as.factor(new$AGE)
new$OLDCLAIMf<-as.factor(new$OLDCLAIM)
new$CLM_FREQ<-as.factor(new$CLM_FREQ)
new$MVR_PTS<-as.factor(new$MVR_PTS)
new$CAR_AGE<-as.factor(new$CAR_AGE)
new$TARGET_FLAG<-as.factor(new$TARGET_FLAG)
woemodel <- woe(TARGET_FLAG~., data = new, zeroadj=0.5, applyontrain = TRUE)
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
## At least one empty cell (class x level) does exists. Zero adjustment applied!
woemodel
## Information values of transformed variables:
##
## IV
## OLDCLAIMf 0.7175059273
## BLUE_INCOME 0.7097344647
## INCOME 0.4096604413
## URBANICITY 0.3739130414
## HOME_VAL 0.3050147636
## CLM_FREQ 0.2967500330
## MVR_PTS 0.2436927720
## AGEf 0.1908587640
## JOB 0.1786496970
## CAR_TYPE 0.1178921685
## PARENT1 0.1134890318
## EDUCATION 0.1091581941
## REVOKED 0.1051712794
## CAR_USE 0.1022645836
## MSTATUS 0.0925496268
## HOMEKIDSf 0.0841196372
## CAR_AGE 0.0683451826
## TRAVTIME 0.0655595072
## TIF 0.0530542121
## YOJ 0.0529557779
## KIDSDRIVf 0.0527743301
## SEX 0.0022921661
## RED_CAR 0.0002493732
The function shows that if we convert all variables as factors, then OLDCLAIM has the most predictive power.
Let’s look at another R functionv- “rbin”. This function allows us to allocate variables into bins.
library(rbin)
bins <- rbin_manual(new, TARGET_FLAG, AGE, c(34, 38, 41, 43, 45, 47, 49, 52,56))
bins
## Binning Summary
## ---------------------------
## Method Manual
## Response TARGET_FLAG
## Predictor AGE
## Bins 10
## Count 8161
## Goods 2153
## Bads 6008
## Entropy 0.82
## Information Value 0.12
##
##
## # A tibble: 10 x 7
## cut_point bin_count good bad woe iv entropy
## <chr> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 < 34 813 328 485 -0.635 0.0455 0.973
## 2 < 38 829 269 560 -0.293 0.00930 0.909
## 3 < 41 904 267 637 -0.157 0.00282 0.876
## 4 < 43 669 185 484 -0.0645 0.000346 0.851
## 5 < 45 693 170 523 0.0976 0.000789 0.804
## 6 < 47 777 173 604 0.224 0.00452 0.765
## 7 < 49 718 147 571 0.331 0.00885 0.731
## 8 < 52 958 168 790 0.522 0.0279 0.670
## 9 < 56 942 183 759 0.396 0.0164 0.710
## 10 >= 56 858 263 595 -0.210 0.00485 0.889
#rbin_create(new, AGE, bins)
bins1 <- rbin_manual(new, TARGET_FLAG, OLDCLAIM, c(1, 2000, 3500, 4500, 5800, 7000, 8500, 12000, 30000))
bins1
## Binning Summary
## ---------------------------
## Method Manual
## Response TARGET_FLAG
## Predictor OLDCLAIM
## Bins 10
## Count 8161
## Goods 2153
## Bads 6008
## Entropy 0.79
## Information Value 0.3
##
##
## # A tibble: 10 x 7
## cut_point bin_count good bad woe iv entropy
## <chr> <int> <int> <int> <dbl> <dbl> <dbl>
## 1 < 1 5009 898 4111 0.495 0.132 0.678
## 2 < 2000 365 138 227 -0.529 0.0139 0.957
## 3 < 3500 370 153 217 -0.677 0.0236 0.978
## 4 < 4500 322 139 183 -0.751 0.0256 0.986
## 5 < 5800 438 163 275 -0.503 0.0151 0.952
## 6 < 7000 364 138 226 -0.533 0.0141 0.957
## 7 < 8500 338 137 201 -0.643 0.0194 0.974
## 8 < 12000 302 131 171 -0.760 0.0246 0.987
## 9 < 30000 311 122 189 -0.589 0.0148 0.966
## 10 >= 30000 342 134 208 -0.587 0.0162 0.966
#rbin_create(new, OLDCLAIM, bins1)
Let’s use LASSO to compare LASSO’s results to the variables selected by WOE/IV.
ldata<-HW4
ldata$JOB<-as.factor(ldata$JOB)
ldata<-rbin_create(ldata, AGE, bins)
ldata<-rbin_create(ldata, OLDCLAIM, bins1)
vars_nameh<-ldata%>%select_if(is.factor)%>%colnames()%>%str_c(collapse = "+")
vars_nameh1<-ldata[,-c(1:3)]%>%select_if(negate(is.factor))
model_stringh <- paste("TARGET_FLAG ~",vars_nameh )
x_trainh <- model.matrix(as.formula(model_stringh), HW4)
xh<- as.matrix(x_trainh)
x1h <- as.matrix(vars_nameh1)
x2h<-cbind(xh,x1h)
#head(x2h)
#dim(x2h)
dXh<- apply(x2h, 2, function(y) (y - mean(y)) / sd(y) ^ as.logical(sd(y)))
#dim(dXh)
#head(dXh)
dX1h<-cbind(dXh,ldata[,c(2)])
#head(dX1h)
#typeof(dX1h)
trainh <- as.data.frame(dX1h)
#dim(trainh)
#colnames(trainh)
colnames(trainh)[66] <- "TARGET_FLAG"
dYh<-trainh[,66]
#typeof(trainh)
#class(trainh$TARGET_FLAG)
#trainh$KIDSDRIV<-NULL
xh <- model.matrix(TARGET_FLAG ~ .+poly(INCOME,3)+poly(OLDCLAIM,3)+poly(HOME_VAL,3)+poly(BLUEBOOK,3), data = trainh)
#head(trainh)
cv.lasso<- cv.glmnet(x=xh, y=dYh, family = "binomial", alpha = 1)
plot(cv.lasso)
#cv.lasso$lambda.min
cv.lasso$lambda.1se
## [1] 0.007304067
#coef(cv.lasso, s=cv.lasso$lambda.min)
coef(cv.lasso, s=cv.lasso$lambda.1se)
## 79 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) -1.3104377786
## (Intercept) .
## `(Intercept)` .
## PARENT1Yes 0.1019857547
## MSTATUSz_No 0.1862842126
## SEXz_F .
## EDUCATIONBachelors -0.0255323682
## EDUCATIONMasters -0.0012823853
## EDUCATIONPhD .
## `EDUCATIONz_High School` 0.0635828701
## JOBDoctor -0.0383850784
## `JOBHome Maker` .
## JOBLawyer .
## JOBManager -0.1840282732
## JOBMissing .
## JOBProfessional .
## JOBStudent .
## `JOBz_Blue Collar` 0.0310322427
## CAR_USEPrivate -0.3322171711
## `CAR_TYPEPanel Truck` .
## CAR_TYPEPickup 0.0400098915
## `CAR_TYPESports Car` 0.1282677628
## CAR_TYPEVan 0.0290287571
## CAR_TYPEz_SUV 0.1194391451
## RED_CARyes .
## REVOKEDYes 0.2318882048
## `URBANICITYz_Highly Rural/ Rural` -0.7765100430
## KIDSDRIVf1 0.0340810468
## KIDSDRIVf2 .
## KIDSDRIVf3 .
## KIDSDRIVf4 .
## HOMEKIDSf1 0.0108753957
## HOMEKIDSf2 .
## HOMEKIDSf3 .
## HOMEKIDSf4 .
## HOMEKIDSf5 .
## KIDSDRIV 0.1548089487
## AGE -0.1104398475
## HOMEKIDS .
## YOJ .
## INCOME -0.1725923945
## HOME_VAL -0.1569225832
## TRAVTIME 0.1613351379
## BLUEBOOK -0.1158974439
## TIF -0.1587347161
## OLDCLAIM .
## CLM_FREQ 0.1057836448
## MVR_PTS 0.2002015822
## CAR_AGE -0.0539490387
## `AGE_<_38` .
## `AGE_<_41` .
## `AGE_<_43` .
## `AGE_<_45` .
## `AGE_<_47` -0.0032083445
## `AGE_<_49` .
## `AGE_<_52` -0.0346096519
## `AGE_<_56` .
## `AGE_>=_56` 0.1706288966
## `OLDCLAIM_<_2000` .
## `OLDCLAIM_<_3500` 0.0008731253
## `OLDCLAIM_<_4500` -0.0153720195
## `OLDCLAIM_<_5800` 0.0098188586
## `OLDCLAIM_<_7000` 0.0148232776
## `OLDCLAIM_<_8500` .
## `OLDCLAIM_<_12000` .
## `OLDCLAIM_<_30000` .
## `OLDCLAIM_>=_30000` .
## poly(INCOME, 3)1 -1.0211554255
## poly(INCOME, 3)2 4.7347222698
## poly(INCOME, 3)3 .
## poly(OLDCLAIM, 3)1 .
## poly(OLDCLAIM, 3)2 -7.7594224933
## poly(OLDCLAIM, 3)3 3.9617917508
## poly(HOME_VAL, 3)1 -0.0011403323
## poly(HOME_VAL, 3)2 0.1789111559
## poly(HOME_VAL, 3)3 .
## poly(BLUEBOOK, 3)1 -0.9498716995
## poly(BLUEBOOK, 3)2 5.9877334624
## poly(BLUEBOOK, 3)3 .
LASSO choose OLDCLAIM polynomial as the strongest predictor, which is possibly better fit than binnig OLDCLAIM.
Let’s try to fit our variables into model and test the model.
mdata<-HW4
mdata$linc<-log(HW4$INCOME+0.01)
bins <- rbin_manual(mdata, TARGET_FLAG, AGE, c(34, 38, 41, 43, 45, 47, 49, 52,56))
#bins
mdata<-rbin_create(mdata, AGE, bins)
#mdata
bins1 <- rbin_manual(mdata, TARGET_FLAG, OLDCLAIM, c(1, 2000, 3500, 4500, 5800, 7000, 8500, 12000, 30000))
mdata<-rbin_create(mdata, OLDCLAIM, bins1)
mdata$MVR_PTSf<-as.factor(mdata$MVR_PTS)
smp_size <- floor(0.75 * nrow(mdata))
## set the seed to make your partition reproducible
set.seed(123)
train_ind <- sample(seq_len(nrow(mdata)), size = smp_size)
train <- mdata[train_ind, ]
test <- mdata[-train_ind, ]
#train
logitMod <- glm(TARGET_FLAG ~ URBANICITY+MVR_PTS+REVOKED+CAR_USE+JOB+poly(INCOME,2)+I((INCOME+0.01)^(-1))+TRAVTIME+I(TRAVTIME^2)+MSTATUS+TIF+CAR_TYPE+HOME_VAL+PARENT1+BLUEBOOK+I(log(BLUEBOOK))+KIDSDRIV+EDUCATION+I(AGE^0.5)+I(AGE^(-1))+YOJ+`AGE_<_52`+`OLDCLAIM_<_2000`+`OLDCLAIM_<_3500`+`OLDCLAIM_<_5800`+`OLDCLAIM_<_7000`+`OLDCLAIM_<_12000`+`OLDCLAIM_<_30000`+poly(OLDCLAIM,3),data=train,family=binomial(link="logit"))
summary(logitMod)
##
## Call:
## glm(formula = TARGET_FLAG ~ URBANICITY + MVR_PTS + REVOKED +
## CAR_USE + JOB + poly(INCOME, 2) + I((INCOME + 0.01)^(-1)) +
## TRAVTIME + I(TRAVTIME^2) + MSTATUS + TIF + CAR_TYPE + HOME_VAL +
## PARENT1 + BLUEBOOK + I(log(BLUEBOOK)) + KIDSDRIV + EDUCATION +
## I(AGE^0.5) + I(AGE^(-1)) + YOJ + `AGE_<_52` + `OLDCLAIM_<_2000` +
## `OLDCLAIM_<_3500` + `OLDCLAIM_<_5800` + `OLDCLAIM_<_7000` +
## `OLDCLAIM_<_12000` + `OLDCLAIM_<_30000` + poly(OLDCLAIM,
## 3), family = binomial(link = "logit"), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3157 -0.6968 -0.3833 0.5754 3.0636
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.971e+00 2.386e+00 -2.502 0.012349 *
## URBANICITYz_Highly Rural/ Rural -2.312e+00 1.292e-01 -17.899 < 2e-16 ***
## MVR_PTS 9.472e-02 1.645e-02 5.759 8.47e-09 ***
## REVOKEDYes 9.660e-01 1.106e-01 8.737 < 2e-16 ***
## CAR_USEPrivate -7.525e-01 1.069e-01 -7.040 1.92e-12 ***
## JOBDoctor -7.761e-01 3.336e-01 -2.326 0.019999 *
## JOBHome Maker -4.950e-01 1.810e-01 -2.734 0.006256 **
## JOBLawyer -2.863e-01 2.202e-01 -1.300 0.193572
## JOBManager -9.749e-01 1.695e-01 -5.750 8.93e-09 ***
## JOBMissing -3.426e-01 2.303e-01 -1.488 0.136826
## JOBProfessional -1.829e-01 1.475e-01 -1.239 0.215163
## JOBStudent -3.753e-01 1.658e-01 -2.264 0.023562 *
## JOBz_Blue Collar 3.382e-03 1.278e-01 0.026 0.978894
## poly(INCOME, 2)1 -1.333e+01 4.713e+00 -2.829 0.004675 **
## poly(INCOME, 2)2 7.255e+00 3.723e+00 1.949 0.051332 .
## I((INCOME + 0.01)^(-1)) 7.036e-03 2.173e-03 3.237 0.001207 **
## TRAVTIME 3.467e-02 7.948e-03 4.362 1.29e-05 ***
## I(TRAVTIME^2) -2.600e-04 1.053e-04 -2.469 0.013556 *
## MSTATUSz_No 5.294e-01 9.579e-02 5.526 3.27e-08 ***
## TIF -6.111e-02 8.656e-03 -7.059 1.67e-12 ***
## CAR_TYPEPanel Truck 5.242e-01 1.786e-01 2.935 0.003334 **
## CAR_TYPEPickup 6.259e-01 1.183e-01 5.292 1.21e-07 ***
## CAR_TYPESports Car 9.279e-01 1.268e-01 7.319 2.50e-13 ***
## CAR_TYPEVan 7.399e-01 1.430e-01 5.174 2.30e-07 ***
## CAR_TYPEz_SUV 7.663e-01 1.019e-01 7.518 5.56e-14 ***
## HOME_VAL -1.275e-06 3.943e-07 -3.233 0.001224 **
## PARENT1Yes 3.574e-01 1.190e-01 3.003 0.002671 **
## BLUEBOOK 1.039e-05 1.206e-05 0.861 0.388983
## I(log(BLUEBOOK)) -4.605e-01 1.447e-01 -3.182 0.001460 **
## KIDSDRIV 4.592e-01 6.656e-02 6.898 5.26e-12 ***
## EDUCATIONBachelors -3.116e-01 1.322e-01 -2.358 0.018384 *
## EDUCATIONMasters -2.337e-01 1.928e-01 -1.213 0.225312
## EDUCATIONPhD -2.033e-01 2.373e-01 -0.857 0.391514
## EDUCATIONz_High School 1.245e-01 1.128e-01 1.103 0.269956
## I(AGE^0.5) 8.291e-01 2.149e-01 3.859 0.000114 ***
## I(AGE^(-1)) 1.163e+02 2.698e+01 4.311 1.62e-05 ***
## YOJ 3.300e-02 1.293e-02 2.551 0.010741 *
## `AGE_<_52` -3.810e-01 1.198e-01 -3.181 0.001466 **
## `OLDCLAIM_<_2000` 1.894e-01 2.060e-01 0.919 0.357840
## `OLDCLAIM_<_3500` 2.579e-01 1.517e-01 1.700 0.089103 .
## `OLDCLAIM_<_5800` 3.206e-01 1.562e-01 2.052 0.040175 *
## `OLDCLAIM_<_7000` 3.966e-01 1.698e-01 2.336 0.019498 *
## `OLDCLAIM_<_12000` 1.025e-01 1.794e-01 0.571 0.567835
## `OLDCLAIM_<_30000` 2.395e-01 1.918e-01 1.249 0.211683
## poly(OLDCLAIM, 3)1 1.369e+00 2.930e+00 0.467 0.640324
## poly(OLDCLAIM, 3)2 -1.062e+01 3.536e+00 -3.002 0.002681 **
## poly(OLDCLAIM, 3)3 8.134e+00 3.269e+00 2.488 0.012838 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 7053.2 on 6119 degrees of freedom
## Residual deviance: 5337.1 on 6073 degrees of freedom
## AIC: 5431.1
##
## Number of Fisher Scoring iterations: 5
We do testing and our model produces 0.75 accuracy.
library(pscl)
pR2(logitMod)
## llh llhNull G2 McFadden r2ML
## -2668.5727829 -3526.6228402 1716.1001145 0.2433064 0.2445249
## r2CU
## 0.3574137
test$p<-predict(logitMod, test, type ="response")
test$p1<-ifelse(test$p<0.31,0,1)
auc(test$TARGET_FLAG,test$p1)
## [1] 0.7437326
table(test$p1,test$TARGET_FLAG)
##
## 0 1
## 0 1133 146
## 1 365 397
library(caret)
confusionMatrix(table(test$p1,test$TARGET_FLAG)[2:1,2:1])
## Confusion Matrix and Statistics
##
##
## 1 0
## 1 397 365
## 0 146 1133
##
## Accuracy : 0.7496
## 95% CI : (0.7302, 0.7683)
## No Information Rate : 0.734
## P-Value [Acc > NIR] : 0.0566
##
## Kappa : 0.4319
## Mcnemar's Test P-Value : <2e-16
##
## Sensitivity : 0.7311
## Specificity : 0.7563
## Pos Pred Value : 0.5210
## Neg Pred Value : 0.8858
## Prevalence : 0.2660
## Detection Rate : 0.1945
## Detection Prevalence : 0.3733
## Balanced Accuracy : 0.7437
##
## 'Positive' Class : 1
##
library(ROCR)
ROCRpred <- prediction(test$p, test$TARGET_FLAG)
ROCRperf <- performance(ROCRpred, 'tpr','fpr')
plot(ROCRperf, colorize = TRUE, text.adj = c(-0.2,1.7))
#for (i in seq(0, 1, by=0.01)) {
#test$p1<-ifelse(test$p<i,0,1)
#print(i)
#print(auc(test$TARGET_FLAG,test$p1))
#}