Rules based systems are a very popular approach to solving complex business problems. Rules are flexible and allow the business system to change behavior usually without requiring a new code release. In addition, rules based systems are typically more interpretable then other modeling approaches such as support vector machines or neural networks. This interpretability may trump the accuracy gained by black box style machine learning approaches.
Core system rules can be constructed by specialized subject matter experts with deep understanding of the business problem or they can be dynamically constructed by machine learning if historically labeled data exists. This paper will show you how rules can be automatically created using historical data and R.
For illustration of this approach we will use the popular Iris flower data set. The Iris flower data set or Fisher’s Iris data set is a multivariate data set introduced by Sir Ronald Fisher (1936) as an example of discriminant analysis. It is sometimes called Anderson’s Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula “all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus”. We will use this data as a proxy for a categorization problem such as the identification of fraud.
Let’s take a look at the Iris data set. The data set contains 4 predictors (named Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) and 1 output factor (Species). Species is what our rules will attempt to categorize. In your business problem, Species might be a category of Fraud, Risk, etc.
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Let’s look at the box-and-whisker plot of the grouped predictor values. A note on box-and-wisker or boxplots: in descriptive statistics, a boxplot is a convenient way of graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers are plotted as individual points. This plot is produced using base graphics in R.
The model training in this section is based on the the R based caret package. Caret (short for Classification And REgression Training) is a set of functions that attempt to streamline the process for creating predictive models, (see http://caret.r-forge.r-project.org/). In addition it uses the rpart R library for Recursive Partitioning and Regression Trees.
The code to train our model with the rpart2 method cross validation is as follows:
library(rpart)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
rPartMdl <- train( x, y, method = "rpart2", tuneLength = 10, trControl = trainControl( method = "cv"))
## note: only 2 possible values of the max tree depth from the initial fit.
## Truncating the grid to 2 .
Once we have the model trained we can evaluate the rules and configure the rules system based on the output.
A standard plot of the tree:
par(mfrow = c(1, 1), mar = c(5,4,4,2), oma = c(0, 0, 0, 0))
library(partykit)
## Loading required package: grid
##
## Attaching package: 'partykit'
##
## The following object is masked from 'package:grid':
##
## depth
rulesTree <- as.party(rPartMdl$finalModel)
plot( rulesTree)
A pretty version from the rattle package:
par(mfrow = c(1, 1), mar = c(5,4,4,2), oma = c(0, 0, 0, 0))
library(rattle)
## Rattle: A free graphical interface for data mining with R.
## Version 3.1.0 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
fancyRpartPlot(rPartMdl$finalModel)
## Loading required package: rpart.plot
## Loading required package: RColorBrewer
Enjoy!
Note full source code for this example can be found at my github account at https://github.com/bohoro/AutomaticRuleGeneration