Reading List for Weight of Evidence and Information Value

Introduction

A reading list for weight of evidence and information value. Far from complete, but can be a good start.

Reading List

Package List

Example

The data is from an historical marketing campaign from the insurance industry and is automatically downloaded when you install the Information package. The data is stored in two .RDA files, one for the training dataset and one for the validation dataset. Each file has 68 predictive variables and 10k records.

The datasets contain two key indicators:

PURCHASE This variable equals 1 if the client accepted the offer, and 0 otherwise
TREATMENT This variable equals 1 if the client was in the test group (received the offer), and 0 otherwise.

Ranking All Variables Using Adjusted IV

require(Information)
require(knitr)

options(scipen=10)

### Loading the data
data(train, package="Information")
data(valid, package="Information")

### Exclude the control group
train <- subset(train, TREATMENT==1)
valid <- subset(valid, TREATMENT==1)

### Ranking variables using penalized IV  
IV <- create_infotables(data=train,
                        valid=valid,
                        y="PURCHASE")

## [1] "Variable TREATMENT was removed because it has only 1 unique level"

# The strongest six variables:
kable(head(IV$Summary), row.names=FALSE)

Variable	IV	PENALTY	AdjIV
N_OPEN_REV_ACTS	1.0107695	0.0838569	0.9269126
TOT_HI_CRDT_CRDT_LMT	0.9345902	0.1026974	0.8318929
RATIO_BAL_TO_HI_CRDT	0.8232539	0.0654436	0.7578104
D_NA_M_SNC_MST_RCNT_ACT_OPN	0.6355466	0.0766748	0.5588718
M_SNC_OLDST_RETAIL_ACT_OPN	0.5573438	0.0784011	0.4789427
M_SNC_MST_RCNT_ACT_OPN	0.5026402	0.0604470	0.4421932

Analyzing WOE Patterns

The IV$Tables object returned by Information is simply a list of dataframes that contains the WOE tables for all variables in the input dataset. Note that the penalty and IV columns are cumulative.

kable(IV$Tables$N_OPEN_REV_ACTS)

N_OPEN_REV_ACTS	N	Percent	WOE	IV	PENALTY
[0,0]	1469	0.2954545	-2.0465968	0.6401443	0.0570308
[1,2]	958	0.1926790	-0.5900120	0.6958705	0.0622626
[3,3]	310	0.0623492	0.2033085	0.6986029	0.0651455
[4,5]	583	0.1172566	0.4419768	0.7244762	0.0676744
[6,8]	632	0.1271118	0.6148243	0.7810611	0.0715927
[9,11]	453	0.0911102	0.8815772	0.8692672	0.0768324
[12,48]	567	0.1140386	0.9883818	1.0107695	0.0838569

The table shows that the odds of PURCHASE=1 increases as N_OPEN_REV_ACTS increases, although the relationship is not linear.

Note that the Information package attempts to create evenly-sized bins in terms of the number of subjects in each group. However, this is not always possible due to ties in the data, as with N_OPEN_REV_ACTS which has ties at 0. If the variable under consideration is categorical, its distinct categories will show up as rows in the WOE table. Moreover, if the variable has missing values, the WOE table will contain a separate “NA” row which can be used to gauge the impact of missing values. Thus, the framework seamlessly handles missing values and categorical variables without any dummy-coding or imputation .

Plotting WOE Patterns

We can also plot this pattern for better visualization:

plot_infotables(IV, "N_OPEN_REV_ACTS")

For better visualization we can do a multiplot to compare WOE patterns. Here we plot the first nine variables on one page. Note that we can plot as many variables as we want; MultiPlot will simply spread the plots over multiple pages (max of nine plots per page).

MultiPlot(IV, IV$Summary$Variable[1:9])

Omitting Cross Validation

To run IVs without external cross validation, simply omit the validation dataset:

IV <- create_infotables(data=train, y="PURCHASE")

## [1] "Variable TREATMENT was removed because it has only 1 unique level"

Changing the Number of Bins

The default number of bins is 10 but we can choose a different number if we desire more granularity. Note that the IV formula is fairly invariant to the number of bins. Also, note that the bins are selected such that the bins are evenly sized, to the extent that it is possible (depending on the number of ties in the data).

IV <- create_infotables(data=train,
                        valid=valid,
                        y="PURCHASE", 
                        bins=20)

## [1] "Variable TREATMENT was removed because it has only 1 unique level"

kable(IV$Tables$N_OPEN_REV_ACTS)

N_OPEN_REV_ACTS	N	Percent	WOE	IV	PENALTY
[0,0]	1469	0.2954545	-2.0465968	0.6401443	0.0570308
[1,1]	619	0.1244972	-0.6965280	0.6886015	0.0717553
[2,2]	339	0.0681818	-0.4149854	0.6989262	0.0804254
[3,3]	310	0.0623492	0.2033085	0.7016586	0.0833083
[4,4]	321	0.0645615	0.3578370	0.7107970	0.0838951
[5,5]	262	0.0526951	0.5410952	0.7286543	0.0862807
[6,6]	241	0.0484714	0.6636975	0.7540782	0.0905195
[7,8]	391	0.0786404	0.5842726	0.7854698	0.0906233
[9,9]	176	0.0353982	1.0188057	0.8323823	0.1032686
[10,11]	277	0.0557120	0.7920957	0.8751621	0.1081056
[12,15]	308	0.0619469	0.7752886	0.9205752	0.1082536
[16,48]	259	0.0520917	1.2316121	1.0247571	0.1174581

Summary

The purpose of exploratory analysis and variable screening is to get to know the data and assess “univariate” predictive strength, before we deploy more sophisticated variable selection approaches.
The weight of evidence (WOE) and information value (IV) provide a great framework for performing exploratory analysis and variable screening prior to building a binary classifier (e.g., logistic regression). It seamlessly handles missing values and character variables, and the output is easy to interpret.
The information value originates from information theory and is closely related to the concept of mutual information.
The information package is specifically written to perform this type of analysis using parallel processing. It also supports exploratory analysis for uplift models, a growing area within marketing analytics. The information package is not designed to transfer data into WOE vectors for Naive Bayes models, although this feature could be added later.

About Me

At Cheng-Du Jan 2017

Currently, I am chasing my dream at Datatist.

Find me on LindedIn or Instagram.

.