A reading list for weight of evidence and information value. Far from complete, but can be a good start.
The data is from an historical marketing campaign from the insurance industry and is automatically downloaded when you install the Information package. The data is stored in two .RDA files, one for the training dataset and one for the validation dataset. Each file has 68 predictive variables and 10k records.
The datasets contain two key indicators:
require(Information)
require(knitr)
options(scipen=10)
### Loading the data
data(train, package="Information")
data(valid, package="Information")
### Exclude the control group
train <- subset(train, TREATMENT==1)
valid <- subset(valid, TREATMENT==1)
### Ranking variables using penalized IV
IV <- create_infotables(data=train,
valid=valid,
y="PURCHASE")
## [1] "Variable TREATMENT was removed because it has only 1 unique level"
# The strongest six variables:
kable(head(IV$Summary), row.names=FALSE)
| Variable | IV | PENALTY | AdjIV |
|---|---|---|---|
| N_OPEN_REV_ACTS | 1.0107695 | 0.0838569 | 0.9269126 |
| TOT_HI_CRDT_CRDT_LMT | 0.9345902 | 0.1026974 | 0.8318929 |
| RATIO_BAL_TO_HI_CRDT | 0.8232539 | 0.0654436 | 0.7578104 |
| D_NA_M_SNC_MST_RCNT_ACT_OPN | 0.6355466 | 0.0766748 | 0.5588718 |
| M_SNC_OLDST_RETAIL_ACT_OPN | 0.5573438 | 0.0784011 | 0.4789427 |
| M_SNC_MST_RCNT_ACT_OPN | 0.5026402 | 0.0604470 | 0.4421932 |
The IV$Tables object returned by Information is simply a list of dataframes that contains the WOE tables for all variables in the input dataset. Note that the penalty and IV columns are cumulative.
kable(IV$Tables$N_OPEN_REV_ACTS)
| N_OPEN_REV_ACTS | N | Percent | WOE | IV | PENALTY |
|---|---|---|---|---|---|
| [0,0] | 1469 | 0.2954545 | -2.0465968 | 0.6401443 | 0.0570308 |
| [1,2] | 958 | 0.1926790 | -0.5900120 | 0.6958705 | 0.0622626 |
| [3,3] | 310 | 0.0623492 | 0.2033085 | 0.6986029 | 0.0651455 |
| [4,5] | 583 | 0.1172566 | 0.4419768 | 0.7244762 | 0.0676744 |
| [6,8] | 632 | 0.1271118 | 0.6148243 | 0.7810611 | 0.0715927 |
| [9,11] | 453 | 0.0911102 | 0.8815772 | 0.8692672 | 0.0768324 |
| [12,48] | 567 | 0.1140386 | 0.9883818 | 1.0107695 | 0.0838569 |
The table shows that the odds of PURCHASE=1 increases as N_OPEN_REV_ACTS increases, although the relationship is not linear.
Note that the Information package attempts to create evenly-sized bins in terms of the number of subjects in each group. However, this is not always possible due to ties in the data, as with N_OPEN_REV_ACTS which has ties at 0. If the variable under consideration is categorical, its distinct categories will show up as rows in the WOE table. Moreover, if the variable has missing values, the WOE table will contain a separate “NA” row which can be used to gauge the impact of missing values. Thus, the framework seamlessly handles missing values and categorical variables without any dummy-coding or imputation .
We can also plot this pattern for better visualization:
plot_infotables(IV, "N_OPEN_REV_ACTS")
For better visualization we can do a multiplot to compare WOE patterns. Here we plot the first nine variables on one page. Note that we can plot as many variables as we want; MultiPlot will simply spread the plots over multiple pages (max of nine plots per page).
MultiPlot(IV, IV$Summary$Variable[1:9])
To run IVs without external cross validation, simply omit the validation dataset:
IV <- create_infotables(data=train, y="PURCHASE")
## [1] "Variable TREATMENT was removed because it has only 1 unique level"
The default number of bins is 10 but we can choose a different number if we desire more granularity. Note that the IV formula is fairly invariant to the number of bins. Also, note that the bins are selected such that the bins are evenly sized, to the extent that it is possible (depending on the number of ties in the data).
IV <- create_infotables(data=train,
valid=valid,
y="PURCHASE",
bins=20)
## [1] "Variable TREATMENT was removed because it has only 1 unique level"
kable(IV$Tables$N_OPEN_REV_ACTS)
| N_OPEN_REV_ACTS | N | Percent | WOE | IV | PENALTY |
|---|---|---|---|---|---|
| [0,0] | 1469 | 0.2954545 | -2.0465968 | 0.6401443 | 0.0570308 |
| [1,1] | 619 | 0.1244972 | -0.6965280 | 0.6886015 | 0.0717553 |
| [2,2] | 339 | 0.0681818 | -0.4149854 | 0.6989262 | 0.0804254 |
| [3,3] | 310 | 0.0623492 | 0.2033085 | 0.7016586 | 0.0833083 |
| [4,4] | 321 | 0.0645615 | 0.3578370 | 0.7107970 | 0.0838951 |
| [5,5] | 262 | 0.0526951 | 0.5410952 | 0.7286543 | 0.0862807 |
| [6,6] | 241 | 0.0484714 | 0.6636975 | 0.7540782 | 0.0905195 |
| [7,8] | 391 | 0.0786404 | 0.5842726 | 0.7854698 | 0.0906233 |
| [9,9] | 176 | 0.0353982 | 1.0188057 | 0.8323823 | 0.1032686 |
| [10,11] | 277 | 0.0557120 | 0.7920957 | 0.8751621 | 0.1081056 |
| [12,15] | 308 | 0.0619469 | 0.7752886 | 0.9205752 | 0.1082536 |
| [16,48] | 259 | 0.0520917 | 1.2316121 | 1.0247571 | 0.1174581 |
The purpose of exploratory analysis and variable screening is to get to know the data and assess “univariate” predictive strength, before we deploy more sophisticated variable selection approaches.
The weight of evidence (WOE) and information value (IV) provide a great framework for performing exploratory analysis and variable screening prior to building a binary classifier (e.g., logistic regression). It seamlessly handles missing values and character variables, and the output is easy to interpret.
The information value originates from information theory and is closely related to the concept of mutual information.
The information package is specifically written to perform this type of analysis using parallel processing. It also supports exploratory analysis for uplift models, a growing area within marketing analytics. The information package is not designed to transfer data into WOE vectors for Naive Bayes models, although this feature could be added later.