Association rules for market segmentation

Introduction

Understanding the needs of customers is rather a crucial priority for the majority of modern companies. In particular, for those that are big enough to accumulate a sufficient amount of capital to afford to conduct specialized marketing actions. In some of them, there is even a special team of people responsible for that aspect of a firm’s activity. One of a few available strategies that those employees may use is to divide customers into groups based on their social-economic background, e.g. their sex, income, and so forth. The purpose of such actions is to maximize profit by addressing various types of customers differently. While market segmentation may be carried out using clustering, another possible approach - a little less obvious perhaps - is to apply association rules. As a result, identifying patterns in data and, thus, formulating criteria (characteristics) useful in associating new customers with some specified profiles might be conducted. Thanks to that more efficient marketing actions may be prepared and introduced in order to increase spendings of those customers.

In this work, the Apriori algorithm is used in order to find rules that may turn out to be important when distinguishing various groups of customers. The chosen dataset consists of 2,240 observations. It is available on Kaggle platform under the following link. The dataset stores information about the given number of customers, their socio-economic characteristics as well as a general overview of transactions they participated in. More details are presented in the next section.

library("arules")
library("arulesViz")

Dataset details

As said above, the dataset consists of 2,240 observations. However, some rows contain NA. In this work, they were decided to be removed.

data_raw <- read.csv("marketing_campaign.csv", sep = "\t")
data_raw <- na.omit(data_raw)
colnames(data_raw)
##  [1] "ID"                  "Year_Birth"          "Education"          
##  [4] "Marital_Status"      "Income"              "Kidhome"            
##  [7] "Teenhome"            "Dt_Customer"         "Recency"            
## [10] "MntWines"            "MntFruits"           "MntMeatProducts"    
## [13] "MntFishProducts"     "MntSweetProducts"    "MntGoldProds"       
## [16] "NumDealsPurchases"   "NumWebPurchases"     "NumCatalogPurchases"
## [19] "NumStorePurchases"   "NumWebVisitsMonth"   "AcceptedCmp3"       
## [22] "AcceptedCmp4"        "AcceptedCmp5"        "AcceptedCmp1"       
## [25] "AcceptedCmp2"        "Complain"            "Z_CostContact"      
## [28] "Z_Revenue"           "Response"

Information hidden in data might be divided into four groups. Firstly, personal characteristics of customers included in the dataset, such as:

  • birth year (Year_Birth)
  • level of education (Education)
  • marital status (Marital_Status)
  • income (Income)
  • number of children in customer’s household (Kidhome)
  • number of teenegers in customer’s household (Teenhome)
  • date of customer’s enrollment with the company (Dt_Customer)
  • number of days since customer’s last purchase (Recency)
  • whether a customer made a complain in the last two years (binary variable) are presented (Complain).

head(data_raw[, c(2:9, 26)])
##   Year_Birth  Education Marital_Status Income Kidhome Teenhome Dt_Customer
## 1       1957 Graduation         Single  58138       0        0  04-09-2012
## 2       1954 Graduation         Single  46344       1        1  08-03-2014
## 3       1965 Graduation       Together  71613       0        0  21-08-2013
## 4       1984 Graduation       Together  26646       1        0  10-02-2014
## 5       1981        PhD        Married  58293       1        0  19-01-2014
## 6       1967     Master       Together  62513       0        1  09-09-2013
##   Recency Complain
## 1      58        0
## 2      38        0
## 3      26        0
## 4      26        0
## 5      94        0
## 6      16        0
str(data_raw[, c(2:9, 26)])
## 'data.frame':    2216 obs. of  9 variables:
##  $ Year_Birth    : int  1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
##  $ Education     : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status: chr  "Single" "Single" "Together" "Together" ...
##  $ Income        : int  58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
##  $ Kidhome       : int  0 1 0 1 1 0 0 1 1 1 ...
##  $ Teenhome      : int  0 1 0 0 0 1 1 0 0 1 ...
##  $ Dt_Customer   : chr  "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
##  $ Recency       : int  58 38 26 26 94 16 34 32 19 68 ...
##  $ Complain      : int  0 0 0 0 0 0 0 0 0 0 ...
summary(data_raw$Income)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1730   35303   51382   52247   68522  666666
summary(data_raw$Recency)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00   24.00   49.00   49.01   74.00   99.00

The second part is focused on the amount of money spent in the last two years on the following group of products:

  • spent on wine (MntWines)
  • spent on fruit (MntFruits)
  • spent on meat (MntMeatProducts)
  • spent on fish (_MntFishProducts)
  • spent on sweets (MntSweetProducts)
  • spent on gold (MntGoldProds).

head(data_raw[, c(10:15)])
##   MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1      635        88             546             172               88
## 2       11         1               6               2                1
## 3      426        49             127             111               21
## 4       11         4              20              10                3
## 5      173        43             118              46               27
## 6      520        42              98               0               42
##   MntGoldProds
## 1           88
## 2            6
## 3           42
## 4            5
## 5           15
## 6           14
str(data_raw[, c(10:15)])
## 'data.frame':    2216 obs. of  6 variables:
##  $ MntWines        : int  635 11 426 11 173 520 235 76 14 28 ...
##  $ MntFruits       : int  88 1 49 4 43 42 65 10 0 0 ...
##  $ MntMeatProducts : int  546 6 127 20 118 98 164 56 24 6 ...
##  $ MntFishProducts : int  172 2 111 10 46 0 50 3 3 1 ...
##  $ MntSweetProducts: int  88 1 21 3 27 42 49 1 3 1 ...
##  $ MntGoldProds    : int  88 6 42 5 15 14 27 23 2 13 ...
summary(data_raw[, c(10:15)])
##     MntWines        MntFruits      MntMeatProducts  MntFishProducts 
##  Min.   :   0.0   Min.   :  0.00   Min.   :   0.0   Min.   :  0.00  
##  1st Qu.:  24.0   1st Qu.:  2.00   1st Qu.:  16.0   1st Qu.:  3.00  
##  Median : 174.5   Median :  8.00   Median :  68.0   Median : 12.00  
##  Mean   : 305.1   Mean   : 26.36   Mean   : 167.0   Mean   : 37.64  
##  3rd Qu.: 505.0   3rd Qu.: 33.00   3rd Qu.: 232.2   3rd Qu.: 50.00  
##  Max.   :1493.0   Max.   :199.00   Max.   :1725.0   Max.   :259.00  
##  MntSweetProducts  MntGoldProds   
##  Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  1.00   1st Qu.:  9.00  
##  Median :  8.00   Median : 24.50  
##  Mean   : 27.03   Mean   : 43.97  
##  3rd Qu.: 33.00   3rd Qu.: 56.00  
##  Max.   :262.00   Max.   :321.00

Another set of columns is devoted to describing the participation of a given customer in marketing campaigns organised by a company in order to sell its product at discount.

  • Number of purchased with dicount (NumDealsPurchases)
  • whether customer participated in 1st campaign (binary variable AcceptedCmp1)
  • whether customer participated in 2nd campaign (binary variable AcceptedCmp2)
  • whether customer participated in 3rd campaign (binary variable AcceptedCmp3)
  • whether customer participated in 4th campaign (binary variable AcceptedCmp4)
  • whether customer participated in 5th campaign (binary variable AcceptedCmp5)
  • whether customer participated in the last (6th) campaign (binary variable Response).

head(data_raw[, c(16, 24, 25, 21:23, 29)])
##   NumDealsPurchases AcceptedCmp1 AcceptedCmp2 AcceptedCmp3 AcceptedCmp4
## 1                 3            0            0            0            0
## 2                 2            0            0            0            0
## 3                 1            0            0            0            0
## 4                 2            0            0            0            0
## 5                 5            0            0            0            0
## 6                 2            0            0            0            0
##   AcceptedCmp5 Response
## 1            0        1
## 2            0        0
## 3            0        0
## 4            0        0
## 5            0        0
## 6            0        0
str(data_raw[, c(16, 24, 25, 21:23, 29)])
## 'data.frame':    2216 obs. of  7 variables:
##  $ NumDealsPurchases: int  3 2 1 2 5 2 4 2 1 1 ...
##  $ AcceptedCmp1     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp3     : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ AcceptedCmp4     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Response         : int  1 0 0 0 0 0 0 0 1 0 ...
summary(data_raw$NumDealsPurchases)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   1.000   2.000   2.324   3.000  15.000

The last group consists of four columns that inform about channels that are used by a given customer to buy a firm’s product and customer online activity.

  • Number of purchases made through a website (NumWebPurchases)
  • number of purchases made using a catalogue (NumCatalogPurchases)
  • number of purchase made directly in stores (NumStorePurchases)
  • number of visits to a website in the last mouth (NumWebVisitsMonth.

head(data_raw[, c(17:20)])
##   NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
## 1               8                  10                 4                 7
## 2               1                   1                 2                 5
## 3               8                   2                10                 4
## 4               2                   0                 4                 6
## 5               5                   3                 6                 5
## 6               6                   4                10                 6
str(data_raw[, c(17:20)])
## 'data.frame':    2216 obs. of  4 variables:
##  $ NumWebPurchases    : int  8 1 8 2 5 6 7 4 3 1 ...
##  $ NumCatalogPurchases: int  10 1 2 0 3 4 3 0 0 0 ...
##  $ NumStorePurchases  : int  4 2 10 4 6 10 7 4 2 0 ...
##  $ NumWebVisitsMonth  : int  7 5 4 6 5 6 6 8 9 20 ...
summary(data_raw[, c(17:20)])
##  NumWebPurchases  NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
##  Min.   : 0.000   Min.   : 0.000      Min.   : 0.000    Min.   : 0.000   
##  1st Qu.: 2.000   1st Qu.: 0.000      1st Qu.: 3.000    1st Qu.: 3.000   
##  Median : 4.000   Median : 2.000      Median : 5.000    Median : 6.000   
##  Mean   : 4.085   Mean   : 2.671      Mean   : 5.801    Mean   : 5.319   
##  3rd Qu.: 6.000   3rd Qu.: 4.000      3rd Qu.: 8.000    3rd Qu.: 7.000   
##  Max.   :27.000   Max.   :28.000      Max.   :13.000    Max.   :20.000

Data preparation

In order to apply the algorithm devoted to finding association rules in data, the latter one has to be transformed a little. Firstly, all continuous variables, e.g. customer’s income or amount of money spent on products, should be changed into categorical data. For instance, customers might be assigned to quantiles based on their income instead of leaving the initial value. Similarly, birth years could be replaced with age intervals indicating to which cohort a given customer should be assigned. The same procedure may be applied to other numerical variables existing in the dataset. All of these changes have one purpose: prepare data for the association rules algorithm to be applied as successfully as possible. The last change worth mentioning is replacing binary variables with words (1 -> “Yes” and 0 -> “No”) just to make it easier for people to understand those records in the set.

# Removing two unidentified columns that contain unknown data:
data_raw$Z_CostContact <- NULL
data_raw$Z_Revenue <- NULL

# We may also get rid of ID numbers:
data_raw$ID <- NULL
# Assigning a customer into an age cohort based on his/her birth year:
data_raw$Year_Birth <- 2022 - data_raw$Year_Birth
data_raw$Age <- ifelse(data_raw$Year_Birth < 18, "0-18",
                       ifelse(data_raw$Year_Birth < 30, "18-29",
                       ifelse(data_raw$Year_Birth < 40, "30-39",
                       ifelse(data_raw$Year_Birth < 50, "40-49",
                       ifelse(data_raw$Year_Birth < 60, "50-59",
                       ifelse(data_raw$Year_Birth < 70, "60-69",
                       ifelse(data_raw$Year_Birth < 80, "70-79", "+80")))))))
data_raw$Year_Birth <- NULL
# Changing binary variables into characters:
data_raw$AcceptedCmp1 <- ifelse(data_raw$AcceptedCmp1 == 1, "Yes", "No") 
data_raw$AcceptedCmp2 <- ifelse(data_raw$AcceptedCmp2 == 1, "Yes", "No")
data_raw$AcceptedCmp3 <- ifelse(data_raw$AcceptedCmp3 == 1, "Yes", "No")
data_raw$AcceptedCmp4 <- ifelse(data_raw$AcceptedCmp4 == 1, "Yes", "No")
data_raw$AcceptedCmp5 <- ifelse(data_raw$AcceptedCmp5 == 1, "Yes", "No")
data_raw$Response <- ifelse(data_raw$Response == 1, "Yes", "No")
data_raw$Complain <- ifelse(data_raw$Complain == 1, "Yes", "No")
# Grouping clients by dates (month and year) of their enrollment with a company:
data_raw$Dt_Customer <- as.character(data_raw$Dt_Customer)
data_raw$Dt_Customer <- substr(data_raw$Dt_Customer, 4, 10)

In order to apply association rules successfully, there is a need to transform continuous sales data into categorical labels. For example, using quantiles. The following scheme may be applied. Values smaller than first quantiles are labelled as “Low”, values higher than the third quantile are called “High”, while everything that is between those quantiles gets the label of “Medium”.

data_raw$Income <- ifelse(data_raw$Income < 35303, "Low",
                   ifelse(data_raw$Income < 68522, "Medium", "High"))
data_raw$NumDealsPurchases <- ifelse(data_raw$NumDealsPurchases < 1, "Low",
                              ifelse(data_raw$NumDealsPurchases < 3, "Medium", "High"))
data_raw$Recency <- ifelse(data_raw$Recency < 24, "Low",
                    ifelse(data_raw$Recency < 74, "Medium", "High"))

data_raw$MntWines <- ifelse(data_raw$MntWines < 24, "Low",
                     ifelse(data_raw$MntWines < 505, "Medium", "High"))
data_raw$MntFruits <- ifelse(data_raw$MntFruits < 2, "Low",
                      ifelse(data_raw$MntFruits < 33, "Medium", "High"))
data_raw$MntFishProducts <- ifelse(data_raw$MntFishProducts < 3, "Low",
                            ifelse(data_raw$MntFishProducts < 50, "Medium", "High"))
data_raw$MntMeatProducts <- ifelse(data_raw$MntMeatProducts < 16, "Low",
                            ifelse(data_raw$MntMeatProducts < 232, "Medium", "High"))
data_raw$MntSweetProducts <- ifelse(data_raw$MntSweetProducts < 1, "Low",
                             ifelse(data_raw$MntSweetProducts < 33, "Medium", "High"))
data_raw$MntGoldProds <- ifelse(data_raw$MntGoldProds < 9, "Low",
                         ifelse(data_raw$MntGoldProds < 56, "Medium", "High"))

data_raw$NumWebPurchases <- ifelse(data_raw$NumWebPurchases < 2, "Low",
                            ifelse(data_raw$NumWebPurchases < 6, "Medium", "High"))
data_raw$NumCatalogPurchases <- ifelse(data_raw$NumCatalogPurchases == 0, "Low",
                                ifelse(data_raw$NumCatalogPurchases < 4, "Medium", "High"))
data_raw$NumStorePurchases <- ifelse(data_raw$NumStorePurchases < 3, "Low",
                              ifelse(data_raw$NumStorePurchases < 8, "Medium", "High"))
data_raw$NumWebVisitsMonth <- ifelse(data_raw$NumWebVisitsMonth < 3, "Low",
                              ifelse(data_raw$NumWebVisitsMonth < 7, "Medium", "High"))

Thanks to the above changes, the dataset is ready for some patterns to be found. For instance, being in the highest income group can somehow influence spending on wine. And so forth. This matter is examined in detail in the next section. Firstly, however, we need to apply final changes in the structure of the dataset so it could be used as input for Apriori algorithm.

data_temp <- lapply(data_raw[, 1:ncol(data_raw)], as.factor)
data <- as.data.frame(matrix(unlist(t(data_temp)), nrow(data_raw), ncol(data_raw)), stringsAsFactors = TRUE)
colnames(data) <- colnames(data_raw)
head(data)
##    Education Marital_Status Income Kidhome Teenhome Dt_Customer Recency
## 1 Graduation         Single Medium       0        0     09-2012  Medium
## 2 Graduation         Single Medium       1        1     03-2014  Medium
## 3 Graduation       Together   High       0        0     08-2013  Medium
## 4 Graduation       Together    Low       1        0     02-2014  Medium
## 5        PhD        Married Medium       1        0     01-2014    High
## 6     Master       Together Medium       0        1     09-2013     Low
##   MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1     High      High            High            High             High
## 2      Low       Low             Low             Low           Medium
## 3   Medium      High          Medium            High           Medium
## 4      Low    Medium          Medium          Medium           Medium
## 5   Medium      High          Medium          Medium           Medium
## 6     High      High          Medium             Low             High
##   MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1         High              High            High                High
## 2          Low            Medium             Low              Medium
## 3       Medium            Medium            High              Medium
## 4          Low            Medium          Medium                 Low
## 5       Medium              High          Medium              Medium
## 6       Medium            Medium            High                High
##   NumStorePurchases NumWebVisitsMonth AcceptedCmp3 AcceptedCmp4 AcceptedCmp5
## 1            Medium              High           No           No           No
## 2               Low            Medium           No           No           No
## 3              High            Medium           No           No           No
## 4            Medium            Medium           No           No           No
## 5            Medium            Medium           No           No           No
## 6              High            Medium           No           No           No
##   AcceptedCmp1 AcceptedCmp2 Complain Response   Age
## 1           No           No       No      Yes 60-69
## 2           No           No       No       No 60-69
## 3           No           No       No       No 50-59
## 4           No           No       No       No 30-39
## 5           No           No       No       No 40-49
## 6           No           No       No       No 50-59

Analysis

Since data preparation is complete, the analysis might be finally started. Apriori algorithm is used to find association rules that could be interpreted as some kind of connections between various characteristics or choices of customers and their consumption structure. Let’s start with a basic application of the Apriori algorithm with no defined right-hand side (consequent) feature. In other words, we are looking for general patterns that might be observed in the examined dataset. The connections do not have to be important from a marketing point of view, however.

When it comes to the algorithm itself, one has to define a few parameters that control its flow. First and foremost, we need to define support and confidence. The first quantity describes in what percentage of customers’ profiles features co-creating a rule exists. Confidence, on the other hand, is support divided by the probability of observing one or more features that are included in a given rule. Additionally, the smallest number of features to be used in a rule is set to 2.

rules1 <- apriori(data, parameter = list(support = 0.85, confidence = 0.9, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.9    0.1    1 none FALSE            TRUE       5    0.85      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 1883 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[103 item(s), 2216 transaction(s)] done [0.00s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [102 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules1
## set of 102 rules
inspect(rules1[1:15])
##      lhs                  rhs               support   confidence coverage 
## [1]  {AcceptedCmp4=No} => {AcceptedCmp3=No} 0.8524368 0.9205653  0.9259928
## [2]  {AcceptedCmp3=No} => {AcceptedCmp4=No} 0.8524368 0.9201169  0.9264440
## [3]  {AcceptedCmp4=No} => {AcceptedCmp5=No} 0.8795126 0.9498051  0.9259928
## [4]  {AcceptedCmp5=No} => {AcceptedCmp4=No} 0.8795126 0.9488802  0.9268953
## [5]  {AcceptedCmp4=No} => {AcceptedCmp1=No} 0.8822202 0.9527290  0.9259928
## [6]  {AcceptedCmp1=No} => {AcceptedCmp4=No} 0.8822202 0.9426230  0.9359206
## [7]  {AcceptedCmp4=No} => {AcceptedCmp2=No} 0.9223827 0.9961014  0.9259928
## [8]  {AcceptedCmp2=No} => {AcceptedCmp4=No} 0.9223827 0.9350412  0.9864621
## [9]  {AcceptedCmp4=No} => {Complain=No}     0.9165162 0.9897661  0.9259928
## [10] {Complain=No}     => {AcceptedCmp4=No} 0.9165162 0.9252847  0.9905235
## [11] {AcceptedCmp3=No} => {AcceptedCmp5=No} 0.8641697 0.9327813  0.9264440
## [12] {AcceptedCmp5=No} => {AcceptedCmp3=No} 0.8641697 0.9323272  0.9268953
## [13] {AcceptedCmp3=No} => {AcceptedCmp1=No} 0.8731949 0.9425231  0.9264440
## [14] {AcceptedCmp1=No} => {AcceptedCmp3=No} 0.8731949 0.9329797  0.9359206
## [15] {AcceptedCmp3=No} => {AcceptedCmp2=No} 0.9160650 0.9887969  0.9264440
##      lift      count
## [1]  0.9936545 1889 
## [2]  0.9936545 1889 
## [3]  1.0247167 1949 
## [4]  1.0247167 1949 
## [5]  1.0179593 1955 
## [6]  1.0179593 1955 
## [7]  1.0097716 2044 
## [8]  1.0097716 2044 
## [9]  0.9992354 2031 
## [10] 0.9992354 2031 
## [11] 1.0063502 1915 
## [12] 1.0063502 1915 
## [13] 1.0070546 1935 
## [14] 1.0070546 1935 
## [15] 1.0023668 2030

Let’s discuss the above results for a second. An impressive number of possible rules was found but the majority of them seems to be quite worthless, from a business point of view. For instance, is it important that people who are single do not accept an invitation to participate in the second marketing campaign? Perhaps in some very exotic scenario, it may be meaningful but it seems that finding more profound rules would be better. And the word better might be understood as rules with more practical meaning. Two fairly simple solution can be applied here. First, we may remove columns that contain data describing customers’ participation in those campaigns. Second, one may focus on finding more interesting (from a business point of view) rules. For example, we may consider factors that influence high meat consumption in our data.

rules2 <- apriori(data, parameter = list(support = 0.2, confidence = 0.2, minlen = 2), appearance = list(default = "lhs", rhs = "MntMeatProducts=High"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.2    0.1    1 none FALSE            TRUE       5     0.2      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 443 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[103 item(s), 2216 transaction(s)] done [0.01s].
## sorting and recoding items ... [53 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 7 8 9 10
## Warning in apriori(data, parameter = list(support = 0.2, confidence = 0.2, :
## Mining stopped (maxlen reached). Only patterns up to a length of 10 returned!
##  done [0.41s].
## writing ... [35 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].
rules2
## set of 35 rules
inspect(sort(rules2, by = "lift")[1:10])
##      lhs                            rhs                      support confidence  coverage     lift count
## [1]  {NumCatalogPurchases=High,                                                                         
##       AcceptedCmp2=No,                                                                                  
##       Complain=No}               => {MntMeatProducts=High} 0.2021661  0.6726727 0.3005415 2.676199   448
## [2]  {NumCatalogPurchases=High,                                                                         
##       AcceptedCmp2=No}           => {MntMeatProducts=High} 0.2035199  0.6721311 0.3027978 2.674044   451
## [3]  {NumCatalogPurchases=High,                                                                         
##       Complain=No}               => {MntMeatProducts=High} 0.2075812  0.6695779 0.3100181 2.663886   460
## [4]  {NumCatalogPurchases=High}  => {MntMeatProducts=High} 0.2089350  0.6690751 0.3122744 2.661886   463
## [5]  {Kidhome=0,                                                                                        
##       Complain=No}               => {MntMeatProducts=High} 0.2346570  0.4078431 0.5753610 1.622586   520
## [6]  {Kidhome=0,                                                                                        
##       AcceptedCmp4=No,                                                                                  
##       Complain=No}               => {MntMeatProducts=High} 0.2084838  0.4077670 0.5112816 1.622283   462
## [7]  {Kidhome=0,                                                                                        
##       AcceptedCmp4=No,                                                                                  
##       AcceptedCmp2=No,                                                                                  
##       Complain=No}               => {MntMeatProducts=High} 0.2071300  0.4072760 0.5085740 1.620329   459
## [8]  {Kidhome=0}                 => {MntMeatProducts=High} 0.2355596  0.4068589 0.5789711 1.618670   522
## [9]  {Kidhome=0,                                                                                        
##       AcceptedCmp4=No}           => {MntMeatProducts=High} 0.2093863  0.4066608 0.5148917 1.617882   464
## [10] {Kidhome=0,                                                                                        
##       AcceptedCmp2=No,                                                                                  
##       Complain=No}               => {MntMeatProducts=High} 0.2287906  0.4065758 0.5627256 1.617544   507
plot(rules2)

The above findings are more interesting. We are able to identify a few dozens of rules that point out connections between high spendings on meat and other features included in the dataset. For instance, we observe that people who spend a high amount of money on meat usually do not live with children. What is more, meat consumers use a company catalogue frequently during their purchase process. One has to admit, however, that the support level chose to find these rule was not very high.

The presented figure shows 35 rules that were found by the algorithm on a support-confidence plane where color indicates a value of lift. The latter quantity describes the rate at which some combinations are observed compared with a scenario of statistical independence. Let us now watch the same graph presented in a few different ways.

plot(rules2, method = "graph")

plot(rules2, method = "paracoord")

Finally, we may change our approach a little. Now let’s discuss a customer profile based on her/his education. As an example people who marked their educational status as “Graduation”.

rules3 <- apriori(data, parameter = list(support = 0.2, confidence = 0.45), appearance =
          list(default = "rhs", lhs = "Education=Graduation"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.45    0.1    1 none FALSE            TRUE       5     0.2      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 443 
## 
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[103 item(s), 2216 transaction(s)] done [0.01s].
## sorting and recoding items ... [53 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [44 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
rules3
## set of 44 rules
inspect(sort(rules3[23:43], by = "lift"))
##      lhs                       rhs                        support   confidence
## [1]  {Education=Graduation} => {MntWines=Medium}          0.2639892 0.5241935 
## [2]  {Education=Graduation} => {MntSweetProducts=Medium}  0.2892599 0.5743728 
## [3]  {Education=Graduation} => {NumWebPurchases=Medium}   0.2784296 0.5528674 
## [4]  {Education=Graduation} => {Teenhome=0}               0.2657942 0.5277778 
## [5]  {Education=Graduation} => {MntFishProducts=Medium}   0.2581227 0.5125448 
## [6]  {Education=Graduation} => {Response=No}              0.4350181 0.8637993 
## [7]  {Education=Graduation} => {Recency=Medium}           0.2563177 0.5089606 
## [8]  {Education=Graduation} => {NumStorePurchases=Medium} 0.3046029 0.6048387 
## [9]  {Education=Graduation} => {MntFruits=Medium}         0.2522563 0.5008961 
## [10] {Education=Graduation} => {AcceptedCmp3=No}          0.4684116 0.9301075 
## [11] {Education=Graduation} => {AcceptedCmp4=No}          0.4679603 0.9292115 
## [12] {Education=Graduation} => {Kidhome=0}                0.2919675 0.5797491 
## [13] {Education=Graduation} => {NumDealsPurchases=Medium} 0.3303249 0.6559140 
## [14] {Education=Graduation} => {AcceptedCmp2=No}          0.4963899 0.9856631 
## [15] {Education=Graduation} => {AcceptedCmp5=No}          0.4648014 0.9229391 
## [16] {Education=Graduation} => {AcceptedCmp1=No}          0.4675090 0.9283154 
## [17] {Education=Graduation} => {Teenhome=1}               0.2278881 0.4525090 
## [18] {Education=Graduation} => {NumWebVisitsMonth=Medium} 0.2301444 0.4569892 
## [19] {Education=Graduation} => {MntMeatProducts=Medium}   0.2459386 0.4883513 
## [20] {Education=Graduation} => {MntGoldProds=Medium}      0.2459386 0.4883513 
## [21] {Education=Graduation} => {Income=Medium}            0.2364621 0.4695341 
##      coverage  lift      count
## [1]  0.5036101 1.0502829  585 
## [2]  0.5036101 1.0256326  641 
## [3]  0.5036101 1.0226662  617 
## [4]  0.5036101 1.0196648  589 
## [5]  0.5036101 1.0186541  572 
## [6]  0.5036101 1.0165583  964 
## [7]  0.5036101 1.0115306  568 
## [8]  0.5036101 1.0062482  675 
## [9]  0.5036101 1.0054218  559 
## [10] 0.5036101 1.0039543 1038 
## [11] 0.5036101 1.0034759 1037 
## [12] 0.5036101 1.0013437  647 
## [13] 0.5036101 1.0003478  732 
## [14] 0.5036101 0.9991900 1100 
## [15] 0.5036101 0.9957317 1030 
## [16] 0.5036101 0.9918741 1036 
## [17] 0.5036101 0.9850293  505 
## [18] 0.5036101 0.9793889  510 
## [19] 0.5036101 0.9749427  545 
## [20] 0.5036101 0.9714420  545 
## [21] 0.5036101 0.9390681  524

Our management team would have to examine these results with great care in order to find whether some valuable information about customers based on the rules that were discovered. Lift values suggest, however, that there is no strong connection between being a graduate and consumption profile.

Summary

A fairly interesting application of association rules is here presented. The Apriori algorithm was applied to customer profiling data in order to establish a bunch of association rules that might be found helpful by a potential marketing specialist to analyse customers’ choices and their consumption basket. For instance, we tried to point out some characteristics that may influence the amount of spending on meat or wine. Additionally, that inference was reversed and we looked for consequences (in terms of customer choices) of a given education level.