Association rules for market segmentation
Introduction
Understanding the needs of customers is rather a crucial priority for the majority of modern companies. In particular, for those that are big enough to accumulate a sufficient amount of capital to afford to conduct specialized marketing actions. In some of them, there is even a special team of people responsible for that aspect of a firm’s activity. One of a few available strategies that those employees may use is to divide customers into groups based on their social-economic background, e.g. their sex, income, and so forth. The purpose of such actions is to maximize profit by addressing various types of customers differently. While market segmentation may be carried out using clustering, another possible approach - a little less obvious perhaps - is to apply association rules. As a result, identifying patterns in data and, thus, formulating criteria (characteristics) useful in associating new customers with some specified profiles might be conducted. Thanks to that more efficient marketing actions may be prepared and introduced in order to increase spendings of those customers.
In this work, the Apriori algorithm is used in order to find rules that may turn out to be important when distinguishing various groups of customers. The chosen dataset consists of 2,240 observations. It is available on Kaggle platform under the following link. The dataset stores information about the given number of customers, their socio-economic characteristics as well as a general overview of transactions they participated in. More details are presented in the next section.
library("arules")
library("arulesViz")Dataset details
As said above, the dataset consists of 2,240 observations. However, some rows contain NA. In this work, they were decided to be removed.
data_raw <- read.csv("marketing_campaign.csv", sep = "\t")
data_raw <- na.omit(data_raw)
colnames(data_raw)## [1] "ID" "Year_Birth" "Education"
## [4] "Marital_Status" "Income" "Kidhome"
## [7] "Teenhome" "Dt_Customer" "Recency"
## [10] "MntWines" "MntFruits" "MntMeatProducts"
## [13] "MntFishProducts" "MntSweetProducts" "MntGoldProds"
## [16] "NumDealsPurchases" "NumWebPurchases" "NumCatalogPurchases"
## [19] "NumStorePurchases" "NumWebVisitsMonth" "AcceptedCmp3"
## [22] "AcceptedCmp4" "AcceptedCmp5" "AcceptedCmp1"
## [25] "AcceptedCmp2" "Complain" "Z_CostContact"
## [28] "Z_Revenue" "Response"
Information hidden in data might be divided into four groups. Firstly, personal characteristics of customers included in the dataset, such as:
- birth year (Year_Birth)
- level of education (Education)
- marital status (Marital_Status)
- income (Income)
- number of children in customer’s household (Kidhome)
- number of teenegers in customer’s household (Teenhome)
- date of customer’s enrollment with the company (Dt_Customer)
- number of days since customer’s last purchase (Recency)
- whether a customer made a complain in the last two years (binary variable) are presented (Complain).
head(data_raw[, c(2:9, 26)])## Year_Birth Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1 1957 Graduation Single 58138 0 0 04-09-2012
## 2 1954 Graduation Single 46344 1 1 08-03-2014
## 3 1965 Graduation Together 71613 0 0 21-08-2013
## 4 1984 Graduation Together 26646 1 0 10-02-2014
## 5 1981 PhD Married 58293 1 0 19-01-2014
## 6 1967 Master Together 62513 0 1 09-09-2013
## Recency Complain
## 1 58 0
## 2 38 0
## 3 26 0
## 4 26 0
## 5 94 0
## 6 16 0
str(data_raw[, c(2:9, 26)])## 'data.frame': 2216 obs. of 9 variables:
## $ Year_Birth : int 1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
## $ Education : chr "Graduation" "Graduation" "Graduation" "Graduation" ...
## $ Marital_Status: chr "Single" "Single" "Together" "Together" ...
## $ Income : int 58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
## $ Kidhome : int 0 1 0 1 1 0 0 1 1 1 ...
## $ Teenhome : int 0 1 0 0 0 1 1 0 0 1 ...
## $ Dt_Customer : chr "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
## $ Recency : int 58 38 26 26 94 16 34 32 19 68 ...
## $ Complain : int 0 0 0 0 0 0 0 0 0 0 ...
summary(data_raw$Income)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1730 35303 51382 52247 68522 666666
summary(data_raw$Recency)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 24.00 49.00 49.01 74.00 99.00
The second part is focused on the amount of money spent in the last two years on the following group of products:
- spent on wine (MntWines)
- spent on fruit (MntFruits)
- spent on meat (MntMeatProducts)
- spent on fish (_MntFishProducts)
- spent on sweets (MntSweetProducts)
- spent on gold (MntGoldProds).
head(data_raw[, c(10:15)])## MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 635 88 546 172 88
## 2 11 1 6 2 1
## 3 426 49 127 111 21
## 4 11 4 20 10 3
## 5 173 43 118 46 27
## 6 520 42 98 0 42
## MntGoldProds
## 1 88
## 2 6
## 3 42
## 4 5
## 5 15
## 6 14
str(data_raw[, c(10:15)])## 'data.frame': 2216 obs. of 6 variables:
## $ MntWines : int 635 11 426 11 173 520 235 76 14 28 ...
## $ MntFruits : int 88 1 49 4 43 42 65 10 0 0 ...
## $ MntMeatProducts : int 546 6 127 20 118 98 164 56 24 6 ...
## $ MntFishProducts : int 172 2 111 10 46 0 50 3 3 1 ...
## $ MntSweetProducts: int 88 1 21 3 27 42 49 1 3 1 ...
## $ MntGoldProds : int 88 6 42 5 15 14 27 23 2 13 ...
summary(data_raw[, c(10:15)])## MntWines MntFruits MntMeatProducts MntFishProducts
## Min. : 0.0 Min. : 0.00 Min. : 0.0 Min. : 0.00
## 1st Qu.: 24.0 1st Qu.: 2.00 1st Qu.: 16.0 1st Qu.: 3.00
## Median : 174.5 Median : 8.00 Median : 68.0 Median : 12.00
## Mean : 305.1 Mean : 26.36 Mean : 167.0 Mean : 37.64
## 3rd Qu.: 505.0 3rd Qu.: 33.00 3rd Qu.: 232.2 3rd Qu.: 50.00
## Max. :1493.0 Max. :199.00 Max. :1725.0 Max. :259.00
## MntSweetProducts MntGoldProds
## Min. : 0.00 Min. : 0.00
## 1st Qu.: 1.00 1st Qu.: 9.00
## Median : 8.00 Median : 24.50
## Mean : 27.03 Mean : 43.97
## 3rd Qu.: 33.00 3rd Qu.: 56.00
## Max. :262.00 Max. :321.00
Another set of columns is devoted to describing the participation of a given customer in marketing campaigns organised by a company in order to sell its product at discount.
- Number of purchased with dicount (NumDealsPurchases)
- whether customer participated in 1st campaign (binary variable AcceptedCmp1)
- whether customer participated in 2nd campaign (binary variable AcceptedCmp2)
- whether customer participated in 3rd campaign (binary variable AcceptedCmp3)
- whether customer participated in 4th campaign (binary variable AcceptedCmp4)
- whether customer participated in 5th campaign (binary variable AcceptedCmp5)
- whether customer participated in the last (6th) campaign (binary variable Response).
head(data_raw[, c(16, 24, 25, 21:23, 29)])## NumDealsPurchases AcceptedCmp1 AcceptedCmp2 AcceptedCmp3 AcceptedCmp4
## 1 3 0 0 0 0
## 2 2 0 0 0 0
## 3 1 0 0 0 0
## 4 2 0 0 0 0
## 5 5 0 0 0 0
## 6 2 0 0 0 0
## AcceptedCmp5 Response
## 1 0 1
## 2 0 0
## 3 0 0
## 4 0 0
## 5 0 0
## 6 0 0
str(data_raw[, c(16, 24, 25, 21:23, 29)])## 'data.frame': 2216 obs. of 7 variables:
## $ NumDealsPurchases: int 3 2 1 2 5 2 4 2 1 1 ...
## $ AcceptedCmp1 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp2 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp3 : int 0 0 0 0 0 0 0 0 0 1 ...
## $ AcceptedCmp4 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ AcceptedCmp5 : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Response : int 1 0 0 0 0 0 0 0 1 0 ...
summary(data_raw$NumDealsPurchases)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.000 2.324 3.000 15.000
The last group consists of four columns that inform about channels that are used by a given customer to buy a firm’s product and customer online activity.
- Number of purchases made through a website (NumWebPurchases)
- number of purchases made using a catalogue (NumCatalogPurchases)
- number of purchase made directly in stores (NumStorePurchases)
- number of visits to a website in the last mouth (NumWebVisitsMonth.
head(data_raw[, c(17:20)])## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 1 8 10 4 7
## 2 1 1 2 5
## 3 8 2 10 4
## 4 2 0 4 6
## 5 5 3 6 5
## 6 6 4 10 6
str(data_raw[, c(17:20)])## 'data.frame': 2216 obs. of 4 variables:
## $ NumWebPurchases : int 8 1 8 2 5 6 7 4 3 1 ...
## $ NumCatalogPurchases: int 10 1 2 0 3 4 3 0 0 0 ...
## $ NumStorePurchases : int 4 2 10 4 6 10 7 4 2 0 ...
## $ NumWebVisitsMonth : int 7 5 4 6 5 6 6 8 9 20 ...
summary(data_raw[, c(17:20)])## NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## Min. : 0.000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median : 4.000 Median : 2.000 Median : 5.000 Median : 6.000
## Mean : 4.085 Mean : 2.671 Mean : 5.801 Mean : 5.319
## 3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.000 3rd Qu.: 7.000
## Max. :27.000 Max. :28.000 Max. :13.000 Max. :20.000
Data preparation
In order to apply the algorithm devoted to finding association rules in data, the latter one has to be transformed a little. Firstly, all continuous variables, e.g. customer’s income or amount of money spent on products, should be changed into categorical data. For instance, customers might be assigned to quantiles based on their income instead of leaving the initial value. Similarly, birth years could be replaced with age intervals indicating to which cohort a given customer should be assigned. The same procedure may be applied to other numerical variables existing in the dataset. All of these changes have one purpose: prepare data for the association rules algorithm to be applied as successfully as possible. The last change worth mentioning is replacing binary variables with words (1 -> “Yes” and 0 -> “No”) just to make it easier for people to understand those records in the set.
# Removing two unidentified columns that contain unknown data:
data_raw$Z_CostContact <- NULL
data_raw$Z_Revenue <- NULL
# We may also get rid of ID numbers:
data_raw$ID <- NULL# Assigning a customer into an age cohort based on his/her birth year:
data_raw$Year_Birth <- 2022 - data_raw$Year_Birth
data_raw$Age <- ifelse(data_raw$Year_Birth < 18, "0-18",
ifelse(data_raw$Year_Birth < 30, "18-29",
ifelse(data_raw$Year_Birth < 40, "30-39",
ifelse(data_raw$Year_Birth < 50, "40-49",
ifelse(data_raw$Year_Birth < 60, "50-59",
ifelse(data_raw$Year_Birth < 70, "60-69",
ifelse(data_raw$Year_Birth < 80, "70-79", "+80")))))))
data_raw$Year_Birth <- NULL# Changing binary variables into characters:
data_raw$AcceptedCmp1 <- ifelse(data_raw$AcceptedCmp1 == 1, "Yes", "No")
data_raw$AcceptedCmp2 <- ifelse(data_raw$AcceptedCmp2 == 1, "Yes", "No")
data_raw$AcceptedCmp3 <- ifelse(data_raw$AcceptedCmp3 == 1, "Yes", "No")
data_raw$AcceptedCmp4 <- ifelse(data_raw$AcceptedCmp4 == 1, "Yes", "No")
data_raw$AcceptedCmp5 <- ifelse(data_raw$AcceptedCmp5 == 1, "Yes", "No")
data_raw$Response <- ifelse(data_raw$Response == 1, "Yes", "No")
data_raw$Complain <- ifelse(data_raw$Complain == 1, "Yes", "No")# Grouping clients by dates (month and year) of their enrollment with a company:
data_raw$Dt_Customer <- as.character(data_raw$Dt_Customer)
data_raw$Dt_Customer <- substr(data_raw$Dt_Customer, 4, 10)In order to apply association rules successfully, there is a need to transform continuous sales data into categorical labels. For example, using quantiles. The following scheme may be applied. Values smaller than first quantiles are labelled as “Low”, values higher than the third quantile are called “High”, while everything that is between those quantiles gets the label of “Medium”.
data_raw$Income <- ifelse(data_raw$Income < 35303, "Low",
ifelse(data_raw$Income < 68522, "Medium", "High"))
data_raw$NumDealsPurchases <- ifelse(data_raw$NumDealsPurchases < 1, "Low",
ifelse(data_raw$NumDealsPurchases < 3, "Medium", "High"))
data_raw$Recency <- ifelse(data_raw$Recency < 24, "Low",
ifelse(data_raw$Recency < 74, "Medium", "High"))
data_raw$MntWines <- ifelse(data_raw$MntWines < 24, "Low",
ifelse(data_raw$MntWines < 505, "Medium", "High"))
data_raw$MntFruits <- ifelse(data_raw$MntFruits < 2, "Low",
ifelse(data_raw$MntFruits < 33, "Medium", "High"))
data_raw$MntFishProducts <- ifelse(data_raw$MntFishProducts < 3, "Low",
ifelse(data_raw$MntFishProducts < 50, "Medium", "High"))
data_raw$MntMeatProducts <- ifelse(data_raw$MntMeatProducts < 16, "Low",
ifelse(data_raw$MntMeatProducts < 232, "Medium", "High"))
data_raw$MntSweetProducts <- ifelse(data_raw$MntSweetProducts < 1, "Low",
ifelse(data_raw$MntSweetProducts < 33, "Medium", "High"))
data_raw$MntGoldProds <- ifelse(data_raw$MntGoldProds < 9, "Low",
ifelse(data_raw$MntGoldProds < 56, "Medium", "High"))
data_raw$NumWebPurchases <- ifelse(data_raw$NumWebPurchases < 2, "Low",
ifelse(data_raw$NumWebPurchases < 6, "Medium", "High"))
data_raw$NumCatalogPurchases <- ifelse(data_raw$NumCatalogPurchases == 0, "Low",
ifelse(data_raw$NumCatalogPurchases < 4, "Medium", "High"))
data_raw$NumStorePurchases <- ifelse(data_raw$NumStorePurchases < 3, "Low",
ifelse(data_raw$NumStorePurchases < 8, "Medium", "High"))
data_raw$NumWebVisitsMonth <- ifelse(data_raw$NumWebVisitsMonth < 3, "Low",
ifelse(data_raw$NumWebVisitsMonth < 7, "Medium", "High"))Thanks to the above changes, the dataset is ready for some patterns to be found. For instance, being in the highest income group can somehow influence spending on wine. And so forth. This matter is examined in detail in the next section. Firstly, however, we need to apply final changes in the structure of the dataset so it could be used as input for Apriori algorithm.
data_temp <- lapply(data_raw[, 1:ncol(data_raw)], as.factor)
data <- as.data.frame(matrix(unlist(t(data_temp)), nrow(data_raw), ncol(data_raw)), stringsAsFactors = TRUE)
colnames(data) <- colnames(data_raw)
head(data)## Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency
## 1 Graduation Single Medium 0 0 09-2012 Medium
## 2 Graduation Single Medium 1 1 03-2014 Medium
## 3 Graduation Together High 0 0 08-2013 Medium
## 4 Graduation Together Low 1 0 02-2014 Medium
## 5 PhD Married Medium 1 0 01-2014 High
## 6 Master Together Medium 0 1 09-2013 Low
## MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1 High High High High High
## 2 Low Low Low Low Medium
## 3 Medium High Medium High Medium
## 4 Low Medium Medium Medium Medium
## 5 Medium High Medium Medium Medium
## 6 High High Medium Low High
## MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1 High High High High
## 2 Low Medium Low Medium
## 3 Medium Medium High Medium
## 4 Low Medium Medium Low
## 5 Medium High Medium Medium
## 6 Medium Medium High High
## NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1 Medium High No No No
## 2 Low Medium No No No
## 3 High Medium No No No
## 4 Medium Medium No No No
## 5 Medium Medium No No No
## 6 High Medium No No No
## AcceptedCmp1 AcceptedCmp2 Complain Response Age
## 1 No No No Yes 60-69
## 2 No No No No 60-69
## 3 No No No No 50-59
## 4 No No No No 30-39
## 5 No No No No 40-49
## 6 No No No No 50-59
Analysis
Since data preparation is complete, the analysis might be finally started. Apriori algorithm is used to find association rules that could be interpreted as some kind of connections between various characteristics or choices of customers and their consumption structure. Let’s start with a basic application of the Apriori algorithm with no defined right-hand side (consequent) feature. In other words, we are looking for general patterns that might be observed in the examined dataset. The connections do not have to be important from a marketing point of view, however.
When it comes to the algorithm itself, one has to define a few parameters that control its flow. First and foremost, we need to define support and confidence. The first quantity describes in what percentage of customers’ profiles features co-creating a rule exists. Confidence, on the other hand, is support divided by the probability of observing one or more features that are included in a given rule. Additionally, the smallest number of features to be used in a rule is set to 2.
rules1 <- apriori(data, parameter = list(support = 0.85, confidence = 0.9, minlen = 2))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.9 0.1 1 none FALSE TRUE 5 0.85 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1883
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[103 item(s), 2216 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [102 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules1## set of 102 rules
inspect(rules1[1:15])## lhs rhs support confidence coverage
## [1] {AcceptedCmp4=No} => {AcceptedCmp3=No} 0.8524368 0.9205653 0.9259928
## [2] {AcceptedCmp3=No} => {AcceptedCmp4=No} 0.8524368 0.9201169 0.9264440
## [3] {AcceptedCmp4=No} => {AcceptedCmp5=No} 0.8795126 0.9498051 0.9259928
## [4] {AcceptedCmp5=No} => {AcceptedCmp4=No} 0.8795126 0.9488802 0.9268953
## [5] {AcceptedCmp4=No} => {AcceptedCmp1=No} 0.8822202 0.9527290 0.9259928
## [6] {AcceptedCmp1=No} => {AcceptedCmp4=No} 0.8822202 0.9426230 0.9359206
## [7] {AcceptedCmp4=No} => {AcceptedCmp2=No} 0.9223827 0.9961014 0.9259928
## [8] {AcceptedCmp2=No} => {AcceptedCmp4=No} 0.9223827 0.9350412 0.9864621
## [9] {AcceptedCmp4=No} => {Complain=No} 0.9165162 0.9897661 0.9259928
## [10] {Complain=No} => {AcceptedCmp4=No} 0.9165162 0.9252847 0.9905235
## [11] {AcceptedCmp3=No} => {AcceptedCmp5=No} 0.8641697 0.9327813 0.9264440
## [12] {AcceptedCmp5=No} => {AcceptedCmp3=No} 0.8641697 0.9323272 0.9268953
## [13] {AcceptedCmp3=No} => {AcceptedCmp1=No} 0.8731949 0.9425231 0.9264440
## [14] {AcceptedCmp1=No} => {AcceptedCmp3=No} 0.8731949 0.9329797 0.9359206
## [15] {AcceptedCmp3=No} => {AcceptedCmp2=No} 0.9160650 0.9887969 0.9264440
## lift count
## [1] 0.9936545 1889
## [2] 0.9936545 1889
## [3] 1.0247167 1949
## [4] 1.0247167 1949
## [5] 1.0179593 1955
## [6] 1.0179593 1955
## [7] 1.0097716 2044
## [8] 1.0097716 2044
## [9] 0.9992354 2031
## [10] 0.9992354 2031
## [11] 1.0063502 1915
## [12] 1.0063502 1915
## [13] 1.0070546 1935
## [14] 1.0070546 1935
## [15] 1.0023668 2030
Let’s discuss the above results for a second. An impressive number of possible rules was found but the majority of them seems to be quite worthless, from a business point of view. For instance, is it important that people who are single do not accept an invitation to participate in the second marketing campaign? Perhaps in some very exotic scenario, it may be meaningful but it seems that finding more profound rules would be better. And the word better might be understood as rules with more practical meaning. Two fairly simple solution can be applied here. First, we may remove columns that contain data describing customers’ participation in those campaigns. Second, one may focus on finding more interesting (from a business point of view) rules. For example, we may consider factors that influence high meat consumption in our data.
rules2 <- apriori(data, parameter = list(support = 0.2, confidence = 0.2, minlen = 2), appearance = list(default = "lhs", rhs = "MntMeatProducts=High"))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.2 0.1 1 none FALSE TRUE 5 0.2 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 443
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[103 item(s), 2216 transaction(s)] done [0.01s].
## sorting and recoding items ... [53 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(data, parameter = list(support = 0.2, confidence = 0.2, :
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
## done [0.41s].
## writing ... [35 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
rules2## set of 35 rules
inspect(sort(rules2, by = "lift")[1:10])## lhs rhs support confidence coverage lift count
## [1] {NumCatalogPurchases=High,
## AcceptedCmp2=No,
## Complain=No} => {MntMeatProducts=High} 0.2021661 0.6726727 0.3005415 2.676199 448
## [2] {NumCatalogPurchases=High,
## AcceptedCmp2=No} => {MntMeatProducts=High} 0.2035199 0.6721311 0.3027978 2.674044 451
## [3] {NumCatalogPurchases=High,
## Complain=No} => {MntMeatProducts=High} 0.2075812 0.6695779 0.3100181 2.663886 460
## [4] {NumCatalogPurchases=High} => {MntMeatProducts=High} 0.2089350 0.6690751 0.3122744 2.661886 463
## [5] {Kidhome=0,
## Complain=No} => {MntMeatProducts=High} 0.2346570 0.4078431 0.5753610 1.622586 520
## [6] {Kidhome=0,
## AcceptedCmp4=No,
## Complain=No} => {MntMeatProducts=High} 0.2084838 0.4077670 0.5112816 1.622283 462
## [7] {Kidhome=0,
## AcceptedCmp4=No,
## AcceptedCmp2=No,
## Complain=No} => {MntMeatProducts=High} 0.2071300 0.4072760 0.5085740 1.620329 459
## [8] {Kidhome=0} => {MntMeatProducts=High} 0.2355596 0.4068589 0.5789711 1.618670 522
## [9] {Kidhome=0,
## AcceptedCmp4=No} => {MntMeatProducts=High} 0.2093863 0.4066608 0.5148917 1.617882 464
## [10] {Kidhome=0,
## AcceptedCmp2=No,
## Complain=No} => {MntMeatProducts=High} 0.2287906 0.4065758 0.5627256 1.617544 507
plot(rules2)The above findings are more interesting. We are able to identify a few dozens of rules that point out connections between high spendings on meat and other features included in the dataset. For instance, we observe that people who spend a high amount of money on meat usually do not live with children. What is more, meat consumers use a company catalogue frequently during their purchase process. One has to admit, however, that the support level chose to find these rule was not very high.
The presented figure shows 35 rules that were found by the algorithm on a support-confidence plane where color indicates a value of lift. The latter quantity describes the rate at which some combinations are observed compared with a scenario of statistical independence. Let us now watch the same graph presented in a few different ways.
plot(rules2, method = "graph")plot(rules2, method = "paracoord")Finally, we may change our approach a little. Now let’s discuss a customer profile based on her/his education. As an example people who marked their educational status as “Graduation”.
rules3 <- apriori(data, parameter = list(support = 0.2, confidence = 0.45), appearance =
list(default = "rhs", lhs = "Education=Graduation"))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.45 0.1 1 none FALSE TRUE 5 0.2 1
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 443
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[103 item(s), 2216 transaction(s)] done [0.01s].
## sorting and recoding items ... [53 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [44 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
rules3## set of 44 rules
inspect(sort(rules3[23:43], by = "lift"))## lhs rhs support confidence
## [1] {Education=Graduation} => {MntWines=Medium} 0.2639892 0.5241935
## [2] {Education=Graduation} => {MntSweetProducts=Medium} 0.2892599 0.5743728
## [3] {Education=Graduation} => {NumWebPurchases=Medium} 0.2784296 0.5528674
## [4] {Education=Graduation} => {Teenhome=0} 0.2657942 0.5277778
## [5] {Education=Graduation} => {MntFishProducts=Medium} 0.2581227 0.5125448
## [6] {Education=Graduation} => {Response=No} 0.4350181 0.8637993
## [7] {Education=Graduation} => {Recency=Medium} 0.2563177 0.5089606
## [8] {Education=Graduation} => {NumStorePurchases=Medium} 0.3046029 0.6048387
## [9] {Education=Graduation} => {MntFruits=Medium} 0.2522563 0.5008961
## [10] {Education=Graduation} => {AcceptedCmp3=No} 0.4684116 0.9301075
## [11] {Education=Graduation} => {AcceptedCmp4=No} 0.4679603 0.9292115
## [12] {Education=Graduation} => {Kidhome=0} 0.2919675 0.5797491
## [13] {Education=Graduation} => {NumDealsPurchases=Medium} 0.3303249 0.6559140
## [14] {Education=Graduation} => {AcceptedCmp2=No} 0.4963899 0.9856631
## [15] {Education=Graduation} => {AcceptedCmp5=No} 0.4648014 0.9229391
## [16] {Education=Graduation} => {AcceptedCmp1=No} 0.4675090 0.9283154
## [17] {Education=Graduation} => {Teenhome=1} 0.2278881 0.4525090
## [18] {Education=Graduation} => {NumWebVisitsMonth=Medium} 0.2301444 0.4569892
## [19] {Education=Graduation} => {MntMeatProducts=Medium} 0.2459386 0.4883513
## [20] {Education=Graduation} => {MntGoldProds=Medium} 0.2459386 0.4883513
## [21] {Education=Graduation} => {Income=Medium} 0.2364621 0.4695341
## coverage lift count
## [1] 0.5036101 1.0502829 585
## [2] 0.5036101 1.0256326 641
## [3] 0.5036101 1.0226662 617
## [4] 0.5036101 1.0196648 589
## [5] 0.5036101 1.0186541 572
## [6] 0.5036101 1.0165583 964
## [7] 0.5036101 1.0115306 568
## [8] 0.5036101 1.0062482 675
## [9] 0.5036101 1.0054218 559
## [10] 0.5036101 1.0039543 1038
## [11] 0.5036101 1.0034759 1037
## [12] 0.5036101 1.0013437 647
## [13] 0.5036101 1.0003478 732
## [14] 0.5036101 0.9991900 1100
## [15] 0.5036101 0.9957317 1030
## [16] 0.5036101 0.9918741 1036
## [17] 0.5036101 0.9850293 505
## [18] 0.5036101 0.9793889 510
## [19] 0.5036101 0.9749427 545
## [20] 0.5036101 0.9714420 545
## [21] 0.5036101 0.9390681 524
Our management team would have to examine these results with great care in order to find whether some valuable information about customers based on the rules that were discovered. Lift values suggest, however, that there is no strong connection between being a graduate and consumption profile.
Summary
A fairly interesting application of association rules is here presented. The Apriori algorithm was applied to customer profiling data in order to establish a bunch of association rules that might be found helpful by a potential marketing specialist to analyse customers’ choices and their consumption basket. For instance, we tried to point out some characteristics that may influence the amount of spending on meat or wine. Additionally, that inference was reversed and we looked for consequences (in terms of customer choices) of a given education level.