The digitalization of financial services enabled by widespread adoption of smart devices, cloud computing, data accessibility has increased a competitive landscape in the financial industry. Huge amount of electronic data is being maintained by financial institutions around the globe. The huge size of these data bases makes it impossible for the financial institutions to analyze these data bases to retrieve useful information as per the need of the decision makers. One of the most important data mining techniques is association rule mining whose main purpose is to find frequent patterns, associations and relationship between various database items using different Algorithms.
This paper explores the use of this technique in a marketing bank data set. The data set gives information about a marketing campaign (phone calls) of a Portuguese banking institution. We analyze the data set by applying association rule mining to find the association between the marketing campaign and product uptake (whether the client subscribes for a bank term deposit or not). The rules enable the bank to find ways to look for future strategies in order to improve future marketing campaigns.
• Job: type of job ( ‘admin.’, ‘blue-collar’, ‘management’, ‘retired’, ‘services’, ‘technician’, ‘other’)
• Marital: marital status (‘divorced’, ‘married’, ‘single’)
• Education (‘Primary’,‘Secondary’,‘Tertiary’, ‘unknown’)
• Default: has credit in default? ( ‘no’, ‘yes’)
• Housing: has housing loan? (‘no’, ‘yes’)
• Loan: has personal loan? (‘no’, ‘yes’)
• Contact: contact communication type (‘cellular’, ‘telephone’, ‘unknown’)
• Duration: contact duration in seconds.
• Campaign: number of contacts performed during this campaign.
• Previous: number of contacts performed before this campaign.
• Poutcome: outcome of the previous marketing campaign (‘failure’, ‘other’, ‘success’,‘unknown’)
#Load the libraries
suppressPackageStartupMessages(library(arules))
## Warning: package 'arules' was built under R version 4.1.2
suppressPackageStartupMessages(library(arulesViz))
## Warning: package 'arulesViz' was built under R version 4.1.2
suppressPackageStartupMessages(library(plotly))
## Warning: package 'ggplot2' was built under R version 4.1.2
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggplot2))
The numeric fields like ‘age’, ‘balance’, ‘day’, ‘duration’,’campaign’, ‘pdays’, ‘previous’ had to be discretized so that it could be used in the apriori algorithm.
#Read in the data to be preprocessed
bank_data <- read.csv("bank.csv")
bd <- bank_data
Age was put into five age ranges of youth (18-24), young_adult (25-35), middle_aged_adults (36-54), senior-age (50-65), old (66-95). The histogram shows the frequency distribution of our continuous variable Age. The histogram also shows a density plot that visualises the distribution of variable age in our bank data set.
ggplot(bank_data, aes(age)) + geom_histogram(aes(y=..density..),color="black", fill="skyblue") +
scale_x_continuous(breaks=c(18,24,35,50,65,95))+ theme_classic() + ggtitle('Age Distributition') + geom_density(alpha=.2, fill='black',color="#FF6666", linetype=0)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
bd$age <- cut(bd$age, breaks=c(18,24,35,50,65,Inf), labels=c('youth','young_adult','middle_aged_adult','senior_age','old'),include.lowest = TRUE)
The fields that contained “YES” or “NO” like ‘Default’, ’housing’, ’loan’, ‘deposit’ was transformed to “variable name = variable name = YES/NO” so that it would be clear in the rules what attribute is related to the YES or NO.
#change variable names with yes/no.
bd$default <- dplyr::recode(bd$default, no='default=no',yes='default=yes',unknown='default=unknown')
bd$housing <- dplyr::recode(bd$housing, no='housing=no',yes='housing=yes',unknown='housing=unknown')
bd$loan <- dplyr::recode(bd$loan, no='loan=no',yes='loan=yes',unknown='loan=unknown')
bd$deposit <- dplyr::recode(bd$deposit, no='deposit=no',yes='deposit=yes')
The data for bank balance was therefore put into four separate ranges. A histogram was used to check the distribution of the data and also considering the minimum and the maximum value. The four categories were: -$6847-$0, $0-$4000, and $4000-$81204.
ggplot(bank_data, aes(balance)) + geom_histogram(color="black", fill="#2E8BC0") +
scale_x_continuous(breaks=c(-6847,0,4000,10000))+theme_classic() + ggtitle('Bank Balance Distributition')
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
min_balance <- min(bd$balance)
bd$balance = cut(bd$balance, breaks=c(min_balance, 0,4000,Inf), labels= c('-veBal','0-$4000', '$4000+'),include.lowest = TRUE)
The duration column shows time in seconds for the duration of the call. The column was converted into 3 categories of call time set in minutes: ‘less than 10 minutes’, ‘between 10 and 20 minutes’, ’more than 20 minutes.A boxplot was plotted to check the distribution and outliers of variable duration.
ggplot(bank_data, aes(duration,color=duration)) +
geom_boxplot(color="black", fill="#2E8BC0",outlier.color='#145DA0',outlier.alpha=0.2) +
theme_classic() + ggtitle('Duration of a phone call in seconds')
min_duration <- min(bd$duration)
bd$duration <- cut(bd$duration, breaks = c(min_duration, 600,1200,Inf),labels= c('<10mins','10-20mins', '<20+mins'),include.lowest = TRUE)
Campaign shows number of contacts performed during this campaign and in the previous campaign if client was contacted. The continuous variable was transformed in three categorical variables of “1-3times if client was contacted 3 or less times, 4-6times if contact was made between 4 to 6 times, and finally +6 times if client was contacted more than 6 times A boxplot was plotted to check the distribution and outliers of variable campaign.
ggplot(bd, aes(campaign,color=campaign)) +
geom_boxplot(color="black", fill="#2E8BC0",outlier.color='#145DA0',outlier.alpha=0.2) +
theme_classic() + ggtitle('Number of times client contacted')
min_campaign <- min(bd$campaign)
bd$campaign <- cut(bd$campaign, breaks = c(min_campaign, 3,6, Inf), labels= c("1-3times", "4-6times","+6times"), include.lowest = TRUE)
The column ‘Previous’ shows the number of contacts performed before this campaign and ‘pdays’ shows number of days that passed by after the client was last contacted from a previous campaign. If ‘Previous’ = 0, this shows zero contact for the client and corresponds with -1 for ‘Pdays’ which shows client was not previously contacted. For this reason, the column pdays was discarded and previous column was transformed into two categories of PriorContact if client was contacted from a previous campaign and NoPriorContact if no contact was made.
bd$previous <- cut(bd$previous, breaks= c(-Inf,0,Inf), labels= c('NoPriorContact','PriorContact'))
Column attributes pdays, days and month were discarded from our data set for association rule. This is because these columns highly affect our output target. For this reason, the variables were discarded in order to have a realistic predictive model. Character variables in the data set where converted into factors to apply association rule.
# DELETE col Previous
bd <- select(bd, -pdays)
bd <- select(bd, -month)
bd <- select(bd,-day)
bd_test <- as.data.frame(unclass(bd), stringsAsFactors = TRUE)
Data set used in this study consists of 11162 transactions from “Bank Data set”. There are no missing values from the data sets. The job column has 1794 classified as other, 497 in education classified as other. 8326 and 537 in poutcome for previous campaign outcome were classified as unknown and other respectively while 2346 in contact column are classified as unknown. The details summary statistics is shown below.
summary(bd_test)
## age job marital education
## youth : 282 management :2566 divorced:1293 primary :1500
## young_adult :4089 blue-collar:1944 married :6351 secondary:5476
## middle_aged_adult:4320 technician :1823 single :3518 tertiary :3689
## senior_age :2073 admin. :1334 unknown : 497
## old : 398 services : 923
## retired : 778
## (Other) :1794
## default balance housing loan
## default=no :10994 -veBal :1462 housing=no :5881 loan=no :9702
## default=yes: 168 0-$4000:8622 housing=yes:5281 loan=yes:1460
## $4000+ :1078
##
##
##
##
## contact duration campaign previous
## cellular :8042 <10mins :9010 1-3times:9147 NoPriorContact:8324
## telephone: 774 10-20mins:1772 4-6times:1414 PriorContact :2838
## unknown :2346 <20+mins : 380 +6times : 601
##
##
##
##
## poutcome deposit
## failure:1228 deposit=no :5873
## other : 537 deposit=yes:5289
## success:1071
## unknown:8326
##
##
##
Item Frequency shows the items in the dataset that appear most. As we can see that most clients in the data set have not defaulted in their loan obligations followed by loan=no which shows the number of Clients that have not accessed any loan facilities.
#item frequency
bd_matrix1 <- as(bd_test, "transactions")
itemFrequencyPlot(bd_matrix1, topN=25, type="relative", main="Item frequency", col="#2E8BC0")
The Apriori algorithm will be used. Apriori algorithm allows to reduce number of rules in the analysis as the minimum support level is defined at the beginning. Apriori heuristic assumes that if set of two items is frequent (meets minimum support condition), both of included items will be frequent too and that if given item is not frequent, any set of items including this item will not be frequent, too.
Association rules define relationship between occurrence of two or more items. They are characterized by a few of parameters.
Support which is a measure of how many times the joint itemset/rule appears in the database of use.
The confidence level of a rule is how certain a rule is likely to happen. We can also define it as a ratio of support level of both consequent and antecedent items to support level of antecedent item (or items).
Lift is the most important measurement, creating levels of “interesting-ness” by determining if variables have influence on one another or if it was expected for these two variables to occur together. The higher it is, the higher the chance of co-occurrence of X and Y. Lift values higher than one stand for positive relationship of items or itemsets and value lower than one means
When creating the rules for this analysis, it was important to gather rules that would be useful for the business about their clients and what makes them subscribe or not subscribe for a term deposit. Initially, a general overview of the entire data set was used to create rules to see if any deposit measure was present in the top 20 rules of the data set. These rules were created by the following parameters: (minimum support = 0.01, minimum confidence = 95%) and 5 as maximum number of items allowed in a rule. Then, a visual graph was produced to see what items were part of the biggest rulesets and get an over view.
rules<- apriori(bd_test, parameter = list(supp=0.01, conf=0.95, maxlen = 5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.95 0.1 1 none FALSE TRUE 5 0.01 1
## maxlen target ext
## 5 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 111
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[50 item(s), 11162 transaction(s)] done [0.01s].
## sorting and recoding items ... [49 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5
## Warning in apriori(bd_test, parameter = list(supp = 0.01, conf = 0.95, maxlen
## = 5)): Mining stopped (maxlen reached). Only patterns up to a length of 5
## returned!
## done [0.16s].
## writing ... [34275 rule(s)] done [0.01s].
## creating S4 object ... done [0.02s].
rules<- sort(rules, decreasing = FALSE, by ="lift")
plot(rules, measure = c("support", "lift"), shading = "confidence", main = "Support, Lift, and Confidence Top Rules")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
generalRules20<- head(rules, n = 20, by = "lift")
inspect(generalRules20[1:10], linebreak = FALSE)
## lhs rhs
## [1] {poutcome=other} => {previous=PriorContact}
## [2] {poutcome=success} => {previous=PriorContact}
## [3] {poutcome=failure} => {previous=PriorContact}
## [4] {job=management, poutcome=other} => {previous=PriorContact}
## [5] {marital=single, poutcome=other} => {previous=PriorContact}
## [6] {education=tertiary, poutcome=other} => {previous=PriorContact}
## [7] {age=young_adult, poutcome=other} => {previous=PriorContact}
## [8] {age=middle_aged_adult, poutcome=other} => {previous=PriorContact}
## [9] {housing=housing=yes, poutcome=other} => {previous=PriorContact}
## [10] {poutcome=other, deposit=deposit=yes} => {previous=PriorContact}
## support confidence coverage lift count
## [1] 0.04810966 1 0.04810966 3.933051 537
## [2] 0.09595055 1 0.09595055 3.933051 1071
## [3] 0.11001613 1 0.11001613 3.933051 1228
## [4] 0.01263214 1 0.01263214 3.933051 141
## [5] 0.01845547 1 0.01845547 3.933051 206
## [6] 0.01720122 1 0.01720122 3.933051 192
## [7] 0.01997850 1 0.01997850 3.933051 223
## [8] 0.01675327 1 0.01675327 3.933051 187
## [9] 0.02383085 1 0.02383085 3.933051 266
## [10] 0.02750403 1 0.02750403 3.933051 307
plot(generalRules20, method = "graph")
After gathering general rules about the data set, the right-hand side (RHS) was set to either “deposit=deposit=yes” or “deposit=deposit=no” to examine what kind of clients subscribed for deposit after the campaign. The first set of rules were based on clients that subscribed for deposit and the parameters were set to support at 0.01 and confidence at 85%. Furthermore, the rules were limited to a maximum of six items per itemset to reduce redundancy with too many items per rule.
#rules by confidence
yesrules<- apriori(bd_test, parameter = list(supp=0.01, conf = 0.85, maxlen = 6),
appearance =list(default = "lhs", rhs="deposit=deposit=yes"),
control=list(verbose=F))
summary(yesrules)
## set of 2810 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3 4 5 6
## 2 45 284 863 1616
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 5.00 6.00 5.44 6.00 6.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01003 Min. :0.8500 Min. :0.01030 Min. :1.794
## 1st Qu.:0.01245 1st Qu.:0.8737 1st Qu.:0.01380 1st Qu.:1.844
## Median :0.01684 Median :0.9031 Median :0.01863 Median :1.906
## Mean :0.02166 Mean :0.8984 Mean :0.02407 Mean :1.896
## 3rd Qu.:0.02517 3rd Qu.:0.9187 3rd Qu.:0.02804 3rd Qu.:1.939
## Max. :0.10115 Max. :0.9783 Max. :0.11826 Max. :2.065
## count
## Min. : 112.0
## 1st Qu.: 139.0
## Median : 188.0
## Mean : 241.7
## 3rd Qu.: 281.0
## Max. :1129.0
##
## mining info:
## data ntransactions support confidence
## bd_test 11162 0.01 0.85
## call
## apriori(data = bd_test, parameter = list(supp = 0.01, conf = 0.85, maxlen = 6), appearance = list(default = "lhs", rhs = "deposit=deposit=yes"), control = list(verbose = F))
With this chunk of code, we created 2810 rules. From the summary we can also determine the lift statistics for all the rules combined. Lift, as a remainder, is the dependency measure, where we compute chances of X and Y occurring together. From what we can see, our minimum lift is 1.794 which is above 1, therefore we can conclude that all the rules have the positive dependency.
#yesrules by confidence
conf_yesrules<- sort(yesrules , by ="confidence", decreasing = T )
inspect(conf_yesrules[1:10], linebreak = FALSE)
## lhs
## [1] {age=senior_age, marital=married, housing=housing=no, campaign=1-3times, poutcome=success}
## [2] {age=senior_age, marital=married, housing=housing=no, contact=cellular, poutcome=success}
## [3] {age=senior_age, marital=married, balance=0-$4000, housing=housing=no, poutcome=success}
## [4] {duration=10-20mins, poutcome=success}
## [5] {duration=10-20mins, previous=PriorContact, poutcome=success}
## [6] {default=default=no, duration=10-20mins, poutcome=success}
## [7] {default=default=no, duration=10-20mins, previous=PriorContact, poutcome=success}
## [8] {loan=loan=no, duration=10-20mins, poutcome=success}
## [9] {loan=loan=no, duration=10-20mins, previous=PriorContact, poutcome=success}
## [10] {default=default=no, loan=loan=no, duration=10-20mins, poutcome=success}
## rhs support confidence coverage lift count
## [1] => {deposit=deposit=yes} 0.01209461 0.9782609 0.01236338 2.064539 135
## [2] => {deposit=deposit=yes} 0.01146748 0.9770992 0.01173625 2.062088 128
## [3] => {deposit=deposit=yes} 0.01110912 0.9763780 0.01137789 2.060565 124
## [4] => {deposit=deposit=yes} 0.01075076 0.9756098 0.01101953 2.058944 120
## [5] => {deposit=deposit=yes} 0.01075076 0.9756098 0.01101953 2.058944 120
## [6] => {deposit=deposit=yes} 0.01075076 0.9756098 0.01101953 2.058944 120
## [7] => {deposit=deposit=yes} 0.01075076 0.9756098 0.01101953 2.058944 120
## [8] => {deposit=deposit=yes} 0.01021322 0.9743590 0.01048199 2.056305 114
## [9] => {deposit=deposit=yes} 0.01021322 0.9743590 0.01048199 2.056305 114
## [10] => {deposit=deposit=yes} 0.01021322 0.9743590 0.01048199 2.056305 114
subrules <- head(conf_yesrules,8)
plot(subrules, method="graph", interactive=FALSE)
## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.
Highest confidence levels achieved by rules in the dataset exceed 90%. First two rules imply that clients who are in the senior age category, married, do not have a housing loan and the previous campaign outcome being a success will subscribe for a term deposit with 97,85% and 97.71% probability respectively. Unfortunately, Support levels in case of those two rules are around 1%, so this rule don’t occur too often – 135 and 128 times.
#yesrules by support
supp_yesrules <- sort(yesrules, by='support', decreasing=TRUE)
inspect(supp_yesrules[1:10], linebreak = FALSE)
## lhs
## [1] {contact=cellular, duration=10-20mins}
## [2] {default=default=no, contact=cellular, duration=10-20mins}
## [3] {poutcome=success}
## [4] {previous=PriorContact, poutcome=success}
## [5] {default=default=no, poutcome=success}
## [6] {default=default=no, previous=PriorContact, poutcome=success}
## [7] {loan=loan=no, contact=cellular, duration=10-20mins}
## [8] {default=default=no, loan=loan=no, contact=cellular, duration=10-20mins}
## [9] {loan=loan=no, poutcome=success}
## [10] {loan=loan=no, previous=PriorContact, poutcome=success}
## rhs support confidence coverage lift count
## [1] => {deposit=deposit=yes} 0.10114675 0.8553030 0.11825838 1.805047 1129
## [2] => {deposit=deposit=yes} 0.09935495 0.8557099 0.11610822 1.805905 1109
## [3] => {deposit=deposit=yes} 0.08761871 0.9131653 0.09595055 1.927160 978
## [4] => {deposit=deposit=yes} 0.08761871 0.9131653 0.09595055 1.927160 978
## [5] => {deposit=deposit=yes} 0.08761871 0.9131653 0.09595055 1.927160 978
## [6] => {deposit=deposit=yes} 0.08761871 0.9131653 0.09595055 1.927160 978
## [7] => {deposit=deposit=yes} 0.08663322 0.8512324 0.10177388 1.796456 967
## [8] => {deposit=deposit=yes} 0.08564773 0.8512912 0.10060921 1.796580 956
## [9] => {deposit=deposit=yes} 0.08367676 0.9156863 0.09138147 1.932481 934
## [10] => {deposit=deposit=yes} 0.08367676 0.9156863 0.09138147 1.932481 934
subrules <- head(supp_yesrules,10)
plot(subrules, method="graph", interactive=FALSE)
## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.
Highest support level achieved by rules determined in analyzed dataset is 10.11%. This level refers to clients who were contacted with cellphone, for a duration of 10-20 minutes and subscribed for a term deposit. This occurs in 10.11% of all 11162 clients – 1128 times.
lift_yesrules <- sort(yesrules, by='lift', decreasing=TRUE)
inspect(lift_yesrules[1:10], linebreak = FALSE)
## lhs
## [1] {age=senior_age, marital=married, housing=housing=no, campaign=1-3times, poutcome=success}
## [2] {age=senior_age, marital=married, housing=housing=no, contact=cellular, poutcome=success}
## [3] {age=senior_age, marital=married, balance=0-$4000, housing=housing=no, poutcome=success}
## [4] {duration=10-20mins, poutcome=success}
## [5] {duration=10-20mins, previous=PriorContact, poutcome=success}
## [6] {default=default=no, duration=10-20mins, poutcome=success}
## [7] {default=default=no, duration=10-20mins, previous=PriorContact, poutcome=success}
## [8] {loan=loan=no, duration=10-20mins, poutcome=success}
## [9] {loan=loan=no, duration=10-20mins, previous=PriorContact, poutcome=success}
## [10] {default=default=no, loan=loan=no, duration=10-20mins, poutcome=success}
## rhs support confidence coverage lift count
## [1] => {deposit=deposit=yes} 0.01209461 0.9782609 0.01236338 2.064539 135
## [2] => {deposit=deposit=yes} 0.01146748 0.9770992 0.01173625 2.062088 128
## [3] => {deposit=deposit=yes} 0.01110912 0.9763780 0.01137789 2.060565 124
## [4] => {deposit=deposit=yes} 0.01075076 0.9756098 0.01101953 2.058944 120
## [5] => {deposit=deposit=yes} 0.01075076 0.9756098 0.01101953 2.058944 120
## [6] => {deposit=deposit=yes} 0.01075076 0.9756098 0.01101953 2.058944 120
## [7] => {deposit=deposit=yes} 0.01075076 0.9756098 0.01101953 2.058944 120
## [8] => {deposit=deposit=yes} 0.01021322 0.9743590 0.01048199 2.056305 114
## [9] => {deposit=deposit=yes} 0.01021322 0.9743590 0.01048199 2.056305 114
## [10] => {deposit=deposit=yes} 0.01021322 0.9743590 0.01048199 2.056305 114
subrules <- head(lift_yesrules,10)
plot(subrules, method="graph", interactive=FALSE)
## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.
Rules can also be compared by lift values. Highest lift levels among rules in the dataset are 2.064 and 2.062 and imply that clients who are in the senior age category, married, do not have a housing loan, were contacted on cellular and previous outcome was success will subscribe for term deposit. In case of those two rules, confidence levels are 97.82% and 97.71%.
norules<- apriori(bd_test, parameter = list(supp=0.01, conf = 0.90, maxlen = 6),
appearance =list(default = "lhs", rhs="deposit=deposit=no"),
control=list(verbose=F))
#norules by Confidence
conf_norules<- sort(norules, by ="confidence", decreasing = TRUE)
inspect(conf_norules[1:5], linebreak = FALSE)
## lhs
## [1] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times}
## [2] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times, previous=NoPriorContact}
## [3] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times, poutcome=unknown}
## [4] {age=middle_aged_adult, default=default=no, balance=-veBal, contact=unknown, duration=<10mins}
## [5] {age=middle_aged_adult, balance=-veBal, contact=unknown, duration=<10mins, campaign=1-3times}
## rhs support confidence coverage lift count
## [1] => {deposit=deposit=no} 0.01003404 0.9655172 0.01039240 1.835025 112
## [2] => {deposit=deposit=no} 0.01003404 0.9655172 0.01039240 1.835025 112
## [3] => {deposit=deposit=no} 0.01003404 0.9655172 0.01039240 1.835025 112
## [4] => {deposit=deposit=no} 0.01227379 0.9647887 0.01272173 1.833641 137
## [5] => {deposit=deposit=no} 0.01182584 0.9635036 0.01227379 1.831198 132
subrules <- head(conf_norules,8)
plot(subrules, method="graph", interactive=FALSE)
## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.
Highest confidence levels achieved by norules in the dataset exceed 90%. First rule imply that clients with a secondary education level,who were contacted between 4 to 6 times during the campaign for less than 10 minutes call duration and with an unknown contact type will not subscribe for a term deposit with 96,55% probability. Support levels in case of this rule are around 1%, so they occur 112 times. Rule {3} imply that clients who are in the middle-age category, with a negative bank balance and were contacted for duration of less than 10 minutes with an unknown contact type, will not subscribe for a term deposit with 96.47% probability. Support levels in case of this rule are around 1.22%, so they occur 137 times.
#norules by support
supp_norules <- sort(norules, by='support', decreasing=TRUE)
inspect(supp_norules[1:5], linebreak = FALSE)
## lhs
## [1] {contact=unknown, duration=<10mins}
## [2] {contact=unknown, duration=<10mins, previous=NoPriorContact}
## [3] {contact=unknown, duration=<10mins, poutcome=unknown}
## [4] {contact=unknown, duration=<10mins, previous=NoPriorContact, poutcome=unknown}
## [5] {default=default=no, contact=unknown, duration=<10mins}
## rhs support confidence coverage lift count
## [1] => {deposit=deposit=no} 0.1545422 0.9146341 0.1689661 1.738319 1725
## [2] => {deposit=deposit=no} 0.1539151 0.9201928 0.1672639 1.748883 1718
## [3] => {deposit=deposit=no} 0.1539151 0.9201928 0.1672639 1.748883 1718
## [4] => {deposit=deposit=no} 0.1539151 0.9201928 0.1672639 1.748883 1718
## [5] => {deposit=deposit=no} 0.1506003 0.9160763 0.1643971 1.741060 1681
subrules <- head(supp_norules,6)
plot(subrules, method="graph", interactive=FALSE)
## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.
Rule {2} is the second highest support level achieved by rules determined in analyzed dataset with 15.39%. This level refers to clients who were contacted for a duration of less than 10 minutes, with contact type being unknown and had no prior contact in the previous campaign are likely not to subscribe for a term deposit. This occurs in 15.39% of all 11162 clients – 1718 times.
#norules by lift
lift_norules <- sort(norules, by='lift', decreasing=TRUE)
inspect(lift_norules[1:5], linebreak = FALSE)
## lhs
## [1] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times}
## [2] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times, previous=NoPriorContact}
## [3] {education=secondary, contact=unknown, duration=<10mins, campaign=4-6times, poutcome=unknown}
## [4] {age=middle_aged_adult, default=default=no, balance=-veBal, contact=unknown, duration=<10mins}
## [5] {age=middle_aged_adult, balance=-veBal, contact=unknown, duration=<10mins, campaign=1-3times}
## rhs support confidence coverage lift count
## [1] => {deposit=deposit=no} 0.01003404 0.9655172 0.01039240 1.835025 112
## [2] => {deposit=deposit=no} 0.01003404 0.9655172 0.01039240 1.835025 112
## [3] => {deposit=deposit=no} 0.01003404 0.9655172 0.01039240 1.835025 112
## [4] => {deposit=deposit=no} 0.01227379 0.9647887 0.01272173 1.833641 137
## [5] => {deposit=deposit=no} 0.01182584 0.9635036 0.01227379 1.831198 132
subrules <- head(lift_norules,7)
plot(subrules, method="graph", interactive=FALSE)
## Warning in plot.rules(subrules, method = "graph", interactive = FALSE): The
## parameter interactive is deprecated. Use engine='interactive' instead.
Implementing the Association Rules measure can be extremely useful in the process of establishing behavior patterns. Apriori algorithm can be used to extract rules that can help in target marketing for Banks and financial institutions. The algorithm is a great technique for an extraction of useful information and remarks from huge dataset.
• Customer with outcomes of the previous campaign equal success, who are in the senior age category (50-65 years), married and do not have a housing loan are most likely to subscribe to a term deposit. The client in this category can be targeted for marketing campaign.
• Customer with outcomes of the previous campaign equal success and were contacted between 10 to 20 minutes are most likely to subscribe to a term deposit. The client in this category can also be targeted for marketing campaign.
• Customer with a secondary education level, contacted between 4 to 6 times during the campaign for less than 10 minutes call duration and with an unknown contact type will likely not subscribe for a term deposit.
• Customer who is in the middle-age category (36-50 years), with a negative bank balance and was contacted for duration of less than 10 minutes with an unknown contact type, will likely not subscribe for a term deposit.
Fjällström, P. (2016). A way to compare measures in association rule mining. Retrieved from https://www.diva-portal.org/smash/get/diva2:956424/FULLTEXT01.pdf
IBM. (2021a). Confidence in an association rule. Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSEPGG_9.7.0/com.ibm.im.model.doc/c_confidence_in_an_association_rule.html
IBM. (2021b). Lift in an association rule. Retrieved from https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.1.0/com.ibm.im.model.doc/c_lift_in_an_association_rule.html
Hahsler, M., & Karpienko, R. (2017). Visualizing association rules in hierarchical groups. Journal of
Business Economics, 87(3), 317–335. https://doi.org/10.1007/s11573-016-0822-8 https://www.kirenz.com/post/2020-05-14-r-association-rule-mining/