Association Rule

Introduction

The digitalization of financial services enabled by widespread adoption of smart devices, cloud computing, data accessibility has increased a competitive landscape in the financial industry. Huge amount of electronic data is being maintained by financial institutions around the globe. The huge size of these data bases makes it impossible for the financial institutions to analyze these data bases to retrieve useful information as per the need of the decision makers. One of the most important data mining techniques is association rule mining whose main purpose is to find frequent patterns, associations and relationship between various database items using different Algorithms.

This paper explores the use of this technique in a marketing bank data set. The data set gives information about a marketing campaign (phone calls) of a Portuguese banking institution. We analyze the data set by applying association rule mining to find the association between the marketing campaign and product uptake (whether the client subscribes for a bank term deposit or not). The rules enable the bank to find ways to look for future strategies in order to improve future marketing campaigns.

About the Data

Attribute Information of some of the dataset

• Job: type of job ( ‘admin.’, ‘blue-collar’, ‘management’, ‘retired’, ‘services’, ‘technician’, ‘other’)

• Marital: marital status (‘divorced’, ‘married’, ‘single’)

• Education (‘Primary’,‘Secondary’,‘Tertiary’, ‘unknown’)

• Default: has credit in default? ( ‘no’, ‘yes’)

• Housing: has housing loan? (‘no’, ‘yes’)

• Loan: has personal loan? (‘no’, ‘yes’)

• Contact: contact communication type (‘cellular’, ‘telephone’, ‘unknown’)

• Duration: contact duration in seconds.

• Campaign: number of contacts performed during this campaign.

• Previous: number of contacts performed before this campaign.

• Poutcome: outcome of the previous marketing campaign (‘failure’, ‘other’, ‘success’,‘unknown’)

Load the libraries to begin the analysis

#Load the libraries 
suppressPackageStartupMessages(library(arules))

## Warning: package 'arules' was built under R version 4.1.2

suppressPackageStartupMessages(library(arulesViz))

## Warning: package 'arulesViz' was built under R version 4.1.2

suppressPackageStartupMessages(library(plotly))

## Warning: package 'ggplot2' was built under R version 4.1.2

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))

Data Preprocessing

The numeric fields like ‘age’, ‘balance’, ‘day’, ‘duration’,’campaign’, ‘pdays’, ‘previous’ had to be discretized so that it could be used in the apriori algorithm.

#Read in the data to be preprocessed
bank_data <- read.csv("bank.csv")
bd <- bank_data

Age was put into five age ranges of youth (18-24), young_adult (25-35), middle_aged_adults (36-54), senior-age (50-65), old (66-95). The histogram shows the frequency distribution of our continuous variable Age. The histogram also shows a density plot that visualises the distribution of variable age in our bank data set.

ggplot(bank_data, aes(age)) + geom_histogram(aes(y=..density..),color="black", fill="skyblue") + 
scale_x_continuous(breaks=c(18,24,35,50,65,95))+ theme_classic() + ggtitle('Age Distributition') + geom_density(alpha=.2, fill='black',color="#FF6666", linetype=0)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

bd$age <- cut(bd$age, breaks=c(18,24,35,50,65,Inf), labels=c('youth','young_adult','middle_aged_adult','senior_age','old'),include.lowest = TRUE)

The fields that contained “YES” or “NO” like ‘Default’, ’housing’, ’loan’, ‘deposit’ was transformed to “variable name = variable name = YES/NO” so that it would be clear in the rules what attribute is related to the YES or NO.

#change variable names with yes/no.
bd$default <- dplyr::recode(bd$default, no='default=no',yes='default=yes',unknown='default=unknown')
bd$housing <- dplyr::recode(bd$housing, no='housing=no',yes='housing=yes',unknown='housing=unknown')
bd$loan <- dplyr::recode(bd$loan, no='loan=no',yes='loan=yes',unknown='loan=unknown')
bd$deposit <- dplyr::recode(bd$deposit, no='deposit=no',yes='deposit=yes')

The data for bank balance was therefore put into four separate ranges. A histogram was used to check the distribution of the data and also considering the minimum and the maximum value. The four categories were: -$6847-$0, $0-$4000, and $4000-$81204.

ggplot(bank_data, aes(balance)) + geom_histogram(color="black", fill="#2E8BC0") + 
  scale_x_continuous(breaks=c(-6847,0,4000,10000))+theme_classic() + ggtitle('Bank Balance Distributition')

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

min_balance <- min(bd$balance)
bd$balance = cut(bd$balance, breaks=c(min_balance, 0,4000,Inf), labels= c('-veBal','0-$4000', '$4000+'),include.lowest = TRUE)

The duration column shows time in seconds for the duration of the call. The column was converted into 3 categories of call time set in minutes: ‘less than 10 minutes’, ‘between 10 and 20 minutes’, ’more than 20 minutes.A boxplot was plotted to check the distribution and outliers of variable duration.

ggplot(bank_data, aes(duration,color=duration)) + 
  geom_boxplot(color="black", fill="#2E8BC0",outlier.color='#145DA0',outlier.alpha=0.2) +
  theme_classic() + ggtitle('Duration of a phone call in seconds')

min_duration <- min(bd$duration)
bd$duration <- cut(bd$duration, breaks = c(min_duration, 600,1200,Inf),labels= c('<10mins','10-20mins', '<20+mins'),include.lowest = TRUE)

Campaign shows number of contacts performed during this campaign and in the previous campaign if client was contacted. The continuous variable was transformed in three categorical variables of “1-3times if client was contacted 3 or less times, 4-6times if contact was made between 4 to 6 times, and finally +6 times if client was contacted more than 6 times A boxplot was plotted to check the distribution and outliers of variable campaign.

ggplot(bd, aes(campaign,color=campaign)) + 
  geom_boxplot(color="black", fill="#2E8BC0",outlier.color='#145DA0',outlier.alpha=0.2) +
  theme_classic() + ggtitle('Number of times client contacted')

min_campaign <- min(bd$campaign)
bd$campaign <- cut(bd$campaign, breaks = c(min_campaign, 3,6, Inf), labels= c("1-3times", "4-6times","+6times"), include.lowest = TRUE)

The column ‘Previous’ shows the number of contacts performed before this campaign and ‘pdays’ shows number of days that passed by after the client was last contacted from a previous campaign. If ‘Previous’ = 0, this shows zero contact for the client and corresponds with -1 for ‘Pdays’ which shows client was not previously contacted. For this reason, the column pdays was discarded and previous column was transformed into two categories of PriorContact if client was contacted from a previous campaign and NoPriorContact if no contact was made.

bd$previous <- cut(bd$previous, breaks= c(-Inf,0,Inf), labels= c('NoPriorContact','PriorContact'))

Column attributes pdays, days and month were discarded from our data set for association rule. This is because these columns highly affect our output target. For this reason, the variables were discarded in order to have a realistic predictive model. Character variables in the data set where converted into factors to apply association rule.

# DELETE col Previous
bd <- select(bd, -pdays)
bd <- select(bd, -month)
bd <- select(bd,-day)
bd_test <- as.data.frame(unclass(bd), stringsAsFactors = TRUE)

Summary statistics

Data set used in this study consists of 11162 transactions from “Bank Data set”. There are no missing values from the data sets. The job column has 1794 classified as other, 497 in education classified as other. 8326 and 537 in poutcome for previous campaign outcome were classified as unknown and other respectively while 2346 in contact column are classified as unknown. The details summary statistics is shown below.

summary(bd_test)

##                 age                job           marital         education   
##  youth            : 282   management :2566   divorced:1293   primary  :1500  
##  young_adult      :4089   blue-collar:1944   married :6351   secondary:5476  
##  middle_aged_adult:4320   technician :1823   single  :3518   tertiary :3689  
##  senior_age       :2073   admin.     :1334                   unknown  : 497  
##  old              : 398   services   : 923                                   
##                           retired    : 778                                   
##                           (Other)    :1794                                   
##         default         balance            housing           loan     
##  default=no :10994   -veBal :1462   housing=no :5881   loan=no :9702  
##  default=yes:  168   0-$4000:8622   housing=yes:5281   loan=yes:1460  
##                      $4000+ :1078                                     
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##       contact          duration        campaign              previous   
##  cellular :8042   <10mins  :9010   1-3times:9147   NoPriorContact:8324  
##  telephone: 774   10-20mins:1772   4-6times:1414   PriorContact  :2838  
##  unknown  :2346   <20+mins : 380   +6times : 601                        
##                                                                         
##                                                                         
##                                                                         
##                                                                         
##     poutcome           deposit    
##  failure:1228   deposit=no :5873  
##  other  : 537   deposit=yes:5289  
##  success:1071                     
##  unknown:8326                     
##                                   
##                                   
##

Item Frequency

Item Frequency shows the items in the dataset that appear most. As we can see that most clients in the data set have not defaulted in their loan obligations followed by loan=no which shows the number of Clients that have not accessed any loan facilities.

#item frequency
bd_matrix1 <- as(bd_test, "transactions")
itemFrequencyPlot(bd_matrix1, topN=25, type="relative", main="Item frequency", col="#2E8BC0")

Apriori algorithm

The Apriori algorithm will be used. Apriori algorithm allows to reduce number of rules in the analysis as the minimum support level is defined at the beginning. Apriori heuristic assumes that if set of two items is frequent (meets minimum support condition), both of included items will be frequent too and that if given item is not frequent, any set of items including this item will not be frequent, too.

Association rules define relationship between occurrence of two or more items. They are characterized by a few of parameters.

Support which is a measure of how many times the joint itemset/rule appears in the database of use.

The confidence level of a rule is how certain a rule is likely to happen. We can also define it as a ratio of support level of both consequent and antecedent items to support level of antecedent item (or items).

Lift is the most important measurement, creating levels of “interesting-ness” by determining if variables have influence on one another or if it was expected for these two variables to occur together. The higher it is, the higher the chance of co-occurrence of X and Y. Lift values higher than one stand for positive relationship of items or itemsets and value lower than one means

Creating the rules (general)

When creating the rules for this analysis, it was important to gather rules that would be useful for the business about their clients and what makes them subscribe or not subscribe for a term deposit. Initially, a general overview of the entire data set was used to create rules to see if any deposit measure was present in the top 20 rules of the data set. These rules were created by the following parameters: (minimum support = 0.01, minimum confidence = 95%) and 5 as maximum number of items allowed in a rule. Then, a visual graph was produced to see what items were part of the biggest rulesets and get an over view.

rules<- apriori(bd_test, parameter = list(supp=0.01, conf=0.95, maxlen = 5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.95    0.1    1 none FALSE            TRUE       5    0.01      1
##  maxlen target  ext
##       5  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 111 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 11162 transaction(s)] done [0.01s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5

## Warning in apriori(bd_test, parameter = list(supp = 0.01, conf = 0.95, maxlen
## = 5)): Mining stopped (maxlen reached). Only patterns up to a length of 5
## returned!

##  done [0.16s].
## writing ... [34275 rule(s)] done [0.01s].
## creating S4 object  ... done [0.02s].

rules<- sort(rules, decreasing = FALSE, by ="lift")
plot(rules, measure = c("support", "lift"), shading = "confidence", main = "Support, Lift, and Confidence Top Rules")

## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.

generalRules20<- head(rules, n = 20, by = "lift")
inspect(generalRules20[1:10], linebreak = FALSE)

##      lhs                                        rhs                    
## [1]  {poutcome=other}                        => {previous=PriorContact}
## [2]  {poutcome=success}                      => {previous=PriorContact}
## [3]  {poutcome=failure}                      => {previous=PriorContact}
## [4]  {job=management, poutcome=other}        => {previous=PriorContact}
## [5]  {marital=single, poutcome=other}        => {previous=PriorContact}
## [6]  {education=tertiary, poutcome=other}    => {previous=PriorContact}
## [7]  {age=young_adult, poutcome=other}       => {previous=PriorContact}
## [8]  {age=middle_aged_adult, poutcome=other} => {previous=PriorContact}
## [9]  {housing=housing=yes, poutcome=other}   => {previous=PriorContact}
## [10] {poutcome=other, deposit=deposit=yes}   => {previous=PriorContact}
##      support    confidence coverage   lift     count
## [1]  0.04810966 1          0.04810966 3.933051  537 
## [2]  0.09595055 1          0.09595055 3.933051 1071 
## [3]  0.11001613 1          0.11001613 3.933051 1228 
## [4]  0.01263214 1          0.01263214 3.933051  141 
## [5]  0.01845547 1          0.01845547 3.933051  206 
## [6]  0.01720122 1          0.01720122 3.933051  192 
## [7]  0.01997850 1          0.01997850 3.933051  223 
## [8]  0.01675327 1          0.01675327 3.933051  187 
## [9]  0.02383085 1          0.02383085 3.933051  266 
## [10] 0.02750403 1          0.02750403 3.933051  307

plot(generalRules20, method = "graph")

Association Rule

Rules for deposit = yes

After gathering general rules about the data set, the right-hand side (RHS) was set to either “deposit=deposit=yes” or “deposit=deposit=no” to examine what kind of clients subscribed for deposit after the campaign. The first set of rules were based on clients that subscribed for deposit and the parameters were set to support at 0.01 and confidence at 85%. Furthermore, the rules were limited to a maximum of six items per itemset to reduce redundancy with too many items per rule.

#rules by confidence
yesrules<- apriori(bd_test, parameter = list(supp=0.01, conf = 0.85, maxlen = 6), 
                   appearance =list(default = "lhs", rhs="deposit=deposit=yes"),
                  control=list(verbose=F))
summary(yesrules)

## set of 2810 rules
## 
## rule length distribution (lhs + rhs):sizes
##    2    3    4    5    6 
##    2   45  284  863 1616 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    5.00    6.00    5.44    6.00    6.00 
## 
## summary of quality measures:
##     support          confidence        coverage            lift      
##  Min.   :0.01003   Min.   :0.8500   Min.   :0.01030   Min.   :1.794  
##  1st Qu.:0.01245   1st Qu.:0.8737   1st Qu.:0.01380   1st Qu.:1.844  
##  Median :0.01684   Median :0.9031   Median :0.01863   Median :1.906  
##  Mean   :0.02166   Mean   :0.8984   Mean   :0.02407   Mean   :1.896  
##  3rd Qu.:0.02517   3rd Qu.:0.9187   3rd Qu.:0.02804   3rd Qu.:1.939  
##  Max.   :0.10115   Max.   :0.9783   Max.   :0.11826   Max.   :2.065  
##      count       
##  Min.   : 112.0  
##  1st Qu.: 139.0  
##  Median : 188.0  
##  Mean   : 241.7  
##  3rd Qu.: 281.0  
##  Max.   :1129.0  
## 
## mining info:
##     data ntransactions support confidence
##  bd_test         11162    0.01       0.85
##                                                                                                                                                                           call
##  apriori(data = bd_test, parameter = list(supp = 0.01, conf = 0.85, maxlen = 6), appearance = list(default = "lhs", rhs = "deposit=deposit=yes"), control = list(verbose = F))

With this chunk of code, we created 2810 rules. From the summary we can also determine the lift statistics for all the rules combined. Lift, as a remainder, is the dependency measure, where we compute chances of X and Y occurring together. From what we can see, our minimum lift is 1.794 which is above 1, therefore we can conclude that all the rules have the positive dependency.

yesrules by Confidence

#yesrules by confidence
conf_yesrules<- sort(yesrules , by ="confidence", decreasing = T )
inspect(conf_yesrules[1:10], linebreak = FALSE)

##      lhs                                                                                       
## [1]  {age=senior_age, marital=married, housing=housing=no, campaign=1-3times, poutcome=success}
## [2]  {age=senior_age, marital=married, housing=housing=no, contact=cellular, poutcome=success} 
## [3]  {age=senior_age, marital=married, balance=0-$4000, housing=housing=no, poutcome=success}  
## [4]  {duration=10-20mins, poutcome=success}                                                    
## [5]  {duration=10-20mins, previous=PriorContact, poutcome=success}                             
## [6]  {default=default=no, duration=10-20mins, poutcome=success}                                
## [7]  {default=default=no, duration=10-20mins, previous=PriorContact, poutcome=success}         
## [8]  {loan=loan=no, duration=10-20mins, poutcome=success}                                      
## [9]  {loan=loan=no, duration=10-20mins, previous=PriorContact, poutcome=success}               
## [10] {default=default=no, loan=loan=no, duration=10-20mins, poutcome=success}                  
##         rhs                   support    confidence coverage   lift     count
## [1]  => {deposit=deposit=yes} 0.01209461 0.9782609  0.01236338 2.064539 135  
## [2]  => {deposit=deposit=yes} 0.01146748 0.9770992  0.01173625 2.062088 128  
## [3]  => {deposit=deposit=yes} 0.01110912 0.9763780  0.01137789 2.060565 124  
## [4]  => {deposit=deposit=yes} 0.01075076 0.9756098  0.01101953 2.058944 120  
## [5]  => {deposit=deposit=yes} 0.01075076 0.9756098  0.01101953 2.058944 120  
## [6]  => {deposit=deposit=yes} 0.01075076 0.9756098  0.01101953 2.058944 120  
## [7]  => {deposit=deposit=yes} 0.01075076 0.9756098  0.01101953 2.058944 120  
## [8]  => {deposit=deposit=yes} 0.01021322 0.9743590  0.01048199 2.056305 114  
## [9]  => {deposit=deposit=yes} 0.01021322 0.9743590  0.01048199 2.056305 114  
## [10] => {deposit=deposit=yes} 0.01021322 0.9743590  0.01048199 2.056305 114

subrules <- head(conf_yesrules,8)
plot(subrules, method="graph", interactive=FALSE)

## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.

Highest confidence levels achieved by rules in the dataset exceed 90%. First two rules imply that clients who are in the senior age category, married, do not have a housing loan and the previous campaign outcome being a success will subscribe for a term deposit with 97,85% and 97.71% probability respectively. Unfortunately, Support levels in case of those two rules are around 1%, so this rule don’t occur too often – 135 and 128 times.

yesrules by Support

#yesrules by support
supp_yesrules <- sort(yesrules, by='support', decreasing=TRUE)
inspect(supp_yesrules[1:10], linebreak = FALSE)

##      lhs                                                                     
## [1]  {contact=cellular, duration=10-20mins}                                  
## [2]  {default=default=no, contact=cellular, duration=10-20mins}              
## [3]  {poutcome=success}                                                      
## [4]  {previous=PriorContact, poutcome=success}                               
## [5]  {default=default=no, poutcome=success}                                  
## [6]  {default=default=no, previous=PriorContact, poutcome=success}           
## [7]  {loan=loan=no, contact=cellular, duration=10-20mins}                    
## [8]  {default=default=no, loan=loan=no, contact=cellular, duration=10-20mins}
## [9]  {loan=loan=no, poutcome=success}                                        
## [10] {loan=loan=no, previous=PriorContact, poutcome=success}                 
##         rhs                   support    confidence coverage   lift     count
## [1]  => {deposit=deposit=yes} 0.10114675 0.8553030  0.11825838 1.805047 1129 
## [2]  => {deposit=deposit=yes} 0.09935495 0.8557099  0.11610822 1.805905 1109 
## [3]  => {deposit=deposit=yes} 0.08761871 0.9131653  0.09595055 1.927160  978 
## [4]  => {deposit=deposit=yes} 0.08761871 0.9131653  0.09595055 1.927160  978 
## [5]  => {deposit=deposit=yes} 0.08761871 0.9131653  0.09595055 1.927160  978 
## [6]  => {deposit=deposit=yes} 0.08761871 0.9131653  0.09595055 1.927160  978 
## [7]  => {deposit=deposit=yes} 0.08663322 0.8512324  0.10177388 1.796456  967 
## [8]  => {deposit=deposit=yes} 0.08564773 0.8512912  0.10060921 1.796580  956 
## [9]  => {deposit=deposit=yes} 0.08367676 0.9156863  0.09138147 1.932481  934 
## [10] => {deposit=deposit=yes} 0.08367676 0.9156863  0.09138147 1.932481  934

subrules <- head(supp_yesrules,10)
plot(subrules, method="graph", interactive=FALSE)

## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.

Highest support level achieved by rules determined in analyzed dataset is 10.11%. This level refers to clients who were contacted with cellphone, for a duration of 10-20 minutes and subscribed for a term deposit. This occurs in 10.11% of all 11162 clients – 1128 times.

yesrules by Lift

lift_yesrules <- sort(yesrules, by='lift', decreasing=TRUE)
inspect(lift_yesrules[1:10], linebreak = FALSE)

##      lhs                                                                                       
## [1]  {age=senior_age, marital=married, housing=housing=no, campaign=1-3times, poutcome=success}
## [2]  {age=senior_age, marital=married, housing=housing=no, contact=cellular, poutcome=success} 
## [3]  {age=senior_age, marital=married, balance=0-$4000, housing=housing=no, poutcome=success}  
## [4]  {duration=10-20mins, poutcome=success}                                                    
## [5]  {duration=10-20mins, previous=PriorContact, poutcome=success}                             
## [6]  {default=default=no, duration=10-20mins, poutcome=success}                                
## [7]  {default=default=no, duration=10-20mins, previous=PriorContact, poutcome=success}         
## [8]  {loan=loan=no, duration=10-20mins, poutcome=success}                                      
## [9]  {loan=loan=no, duration=10-20mins, previous=PriorContact, poutcome=success}               
## [10] {default=default=no, loan=loan=no, duration=10-20mins, poutcome=success}                  
##         rhs                   support    confidence coverage   lift     count
## [1]  => {deposit=deposit=yes} 0.01209461 0.9782609  0.01236338 2.064539 135  
## [2]  => {deposit=deposit=yes} 0.01146748 0.9770992  0.01173625 2.062088 128  
## [3]  => {deposit=deposit=yes} 0.01110912 0.9763780  0.01137789 2.060565 124  
## [4]  => {deposit=deposit=yes} 0.01075076 0.9756098  0.01101953 2.058944 120  
## [5]  => {deposit=deposit=yes} 0.01075076 0.9756098  0.01101953 2.058944 120  
## [6]  => {deposit=deposit=yes} 0.01075076 0.9756098  0.01101953 2.058944 120  
## [7]  => {deposit=deposit=yes} 0.01075076 0.9756098  0.01101953 2.058944 120  
## [8]  => {deposit=deposit=yes} 0.01021322 0.9743590  0.01048199 2.056305 114  
## [9]  => {deposit=deposit=yes} 0.01021322 0.9743590  0.01048199 2.056305 114  
## [10] => {deposit=deposit=yes} 0.01021322 0.9743590  0.01048199 2.056305 114

subrules <- head(lift_yesrules,10)
plot(subrules, method="graph", interactive=FALSE)

## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.

Rules can also be compared by lift values. Highest lift levels among rules in the dataset are 2.064 and 2.062 and imply that clients who are in the senior age category, married, do not have a housing loan, were contacted on cellular and previous outcome was success will subscribe for term deposit. In case of those two rules, confidence levels are 97.82% and 97.71%.

Rules for deposit=deposit=no

norules<- apriori(bd_test, parameter = list(supp=0.01, conf = 0.90, maxlen = 6), 
                  appearance =list(default = "lhs", rhs="deposit=deposit=no"),
                  control=list(verbose=F))

#norules by Confidence
conf_norules<- sort(norules, by ="confidence", decreasing = TRUE)
inspect(conf_norules[1:5], linebreak = FALSE)

##     lhs                                                                                                 
## [1] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times}                         
## [2] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times, previous=NoPriorContact}
## [3] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times, poutcome=unknown}       
## [4] {age=middle_aged_adult, default=default=no, balance=-veBal, contact=unknown, duration=<10mins}      
## [5] {age=middle_aged_adult, balance=-veBal, contact=unknown, duration=<10mins, campaign=1-3times}       
##        rhs                  support    confidence coverage   lift     count
## [1] => {deposit=deposit=no} 0.01003404 0.9655172  0.01039240 1.835025 112  
## [2] => {deposit=deposit=no} 0.01003404 0.9655172  0.01039240 1.835025 112  
## [3] => {deposit=deposit=no} 0.01003404 0.9655172  0.01039240 1.835025 112  
## [4] => {deposit=deposit=no} 0.01227379 0.9647887  0.01272173 1.833641 137  
## [5] => {deposit=deposit=no} 0.01182584 0.9635036  0.01227379 1.831198 132

subrules <- head(conf_norules,8)
plot(subrules, method="graph", interactive=FALSE)

## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.

Highest confidence levels achieved by norules in the dataset exceed 90%. First rule imply that clients with a secondary education level,who were contacted between 4 to 6 times during the campaign for less than 10 minutes call duration and with an unknown contact type will not subscribe for a term deposit with 96,55% probability. Support levels in case of this rule are around 1%, so they occur 112 times. Rule {3} imply that clients who are in the middle-age category, with a negative bank balance and were contacted for duration of less than 10 minutes with an unknown contact type, will not subscribe for a term deposit with 96.47% probability. Support levels in case of this rule are around 1.22%, so they occur 137 times.

#norules by support
supp_norules <- sort(norules, by='support', decreasing=TRUE)
inspect(supp_norules[1:5], linebreak = FALSE)

##     lhs                                                                           
## [1] {contact=unknown, duration=<10mins}                                           
## [2] {contact=unknown, duration=<10mins, previous=NoPriorContact}                  
## [3] {contact=unknown, duration=<10mins, poutcome=unknown}                         
## [4] {contact=unknown, duration=<10mins, previous=NoPriorContact, poutcome=unknown}
## [5] {default=default=no, contact=unknown, duration=<10mins}                       
##        rhs                  support   confidence coverage  lift     count
## [1] => {deposit=deposit=no} 0.1545422 0.9146341  0.1689661 1.738319 1725 
## [2] => {deposit=deposit=no} 0.1539151 0.9201928  0.1672639 1.748883 1718 
## [3] => {deposit=deposit=no} 0.1539151 0.9201928  0.1672639 1.748883 1718 
## [4] => {deposit=deposit=no} 0.1539151 0.9201928  0.1672639 1.748883 1718 
## [5] => {deposit=deposit=no} 0.1506003 0.9160763  0.1643971 1.741060 1681

subrules <- head(supp_norules,6)
plot(subrules, method="graph", interactive=FALSE)

## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.

Rule {2} is the second highest support level achieved by rules determined in analyzed dataset with 15.39%. This level refers to clients who were contacted for a duration of less than 10 minutes, with contact type being unknown and had no prior contact in the previous campaign are likely not to subscribe for a term deposit. This occurs in 15.39% of all 11162 clients – 1718 times.

#norules by lift
lift_norules <- sort(norules, by='lift', decreasing=TRUE)
inspect(lift_norules[1:5], linebreak = FALSE)

##     lhs                                                                                                 
## [1] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times}                         
## [2] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times, previous=NoPriorContact}
## [3] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times, poutcome=unknown}       
## [4] {age=middle_aged_adult, default=default=no, balance=-veBal, contact=unknown, duration=<10mins}      
## [5] {age=middle_aged_adult, balance=-veBal, contact=unknown, duration=<10mins, campaign=1-3times}       
##        rhs                  support    confidence coverage   lift     count
## [1] => {deposit=deposit=no} 0.01003404 0.9655172  0.01039240 1.835025 112  
## [2] => {deposit=deposit=no} 0.01003404 0.9655172  0.01039240 1.835025 112  
## [3] => {deposit=deposit=no} 0.01003404 0.9655172  0.01039240 1.835025 112  
## [4] => {deposit=deposit=no} 0.01227379 0.9647887  0.01272173 1.833641 137  
## [5] => {deposit=deposit=no} 0.01182584 0.9635036  0.01227379 1.831198 132

subrules <- head(lift_norules,7)
plot(subrules, method="graph", interactive=FALSE)

## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.

Conclusion

Implementing the Association Rules measure can be extremely useful in the process of establishing behavior patterns. Apriori algorithm can be used to extract rules that can help in target marketing for Banks and financial institutions. The algorithm is a great technique for an extraction of useful information and remarks from huge dataset.

Interesting rules for yes rules (Deposit=yes)

• Customer with outcomes of the previous campaign equal success, who are in the senior age category (50-65 years), married and do not have a housing loan are most likely to subscribe to a term deposit. The client in this category can be targeted for marketing campaign.

• Customer with outcomes of the previous campaign equal success and were contacted between 10 to 20 minutes are most likely to subscribe to a term deposit. The client in this category can also be targeted for marketing campaign.

Interesting rules for no rules (Deposit=no)

• Customer with a secondary education level, contacted between 4 to 6 times during the campaign for less than 10 minutes call duration and with an unknown contact type will likely not subscribe for a term deposit.

• Customer who is in the middle-age category (36-50 years), with a negative bank balance and was contacted for duration of less than 10 minutes with an unknown contact type, will likely not subscribe for a term deposit.

Reference

Fjällström, P. (2016). A way to compare measures in association rule mining. Retrieved from https://www.diva-portal.org/smash/get/diva2:956424/FULLTEXT01.pdf

IBM. (2021a). Confidence in an association rule. Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.im.model.doc/c_confidence_in_an_association_rule.html

IBM. (2021b). Lift in an association rule. Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.1.0/com.ibm.im.model.doc/c_lift_in_an_association_rule.html

Hahsler, M., & Karpienko, R. (2017). Visualizing association rules in hierarchical groups. Journal of

Business Economics, 87(3), 317–335. https://doi.org/10.1007/s11573-016-0822-8 https://www.kirenz.com/post/2020-05-14-r-association-rule-mining/

Bank Marketing Association Rule

Xolani Keith Mpala

2/26/2022

Association Rule

Introduction

About the Data

Attribute Information of some of the dataset

Load the libraries to begin the analysis

Data Preprocessing

Summary statistics

Item Frequency

Apriori algorithm

Creating the rules (general)

Association Rule

Rules for deposit = yes

yesrules by Confidence

yesrules by Support

yesrules by Lift

Rules for deposit=deposit=no

Conclusion

Interesting rules for yes rules (Deposit=yes)

Interesting rules for no rules (Deposit=no)

Reference