Applying Association rules on 2000 supermarket baskets
1) Project description:
The goal of this paper is to learn how to apply apriori association rules algorithm with R.
In this paper we will analyze a random dataset generated with the help of the website: https://www.dunnhumby.com/. It contains 2000 different supermarket baskets/transactions in two different periods of time, with other additional information. For the purposes of this paper, the columns PROD_CODE (character, with the format: PRD0+6numbers) and BASKET_ID (number with 15 digits) will be selected.
Association rules can be applied in many ways, one of the most popular one appears in a shopping baskets data sets, as it can bring valuable information to the shops, and it can answer many questions like: How the catalog of products should be designed? Where each product should be located in the shop? Which products should receive a promotion? Additionally, we can predict customer behavior.
2) Manipulation of the data:
- First we will load all the necesary
packagesand thedata:
library(arules)
library(plyr)
library(RColorBrewer)
library(arulesViz)
library(Matrix)
library(plot3D)
library(rgl)
library(OceanView)
library(rgl)
library(plot3Drgl)
library(reshape2)
library(treemap)
library(reshape2)In association rules the manipulation of the data is really important, we should have discrete data.
In this case we have continuous data, hence the first step will be to transform the data from continuous to discrete.
First of all we should import the two csv files, each one represents the basket of one week:
SHOP_WEEK SHOP_DATE SHOP_WEEKDAY SHOP_HOUR QUANTITY SPEND PROD_CODE
1 200626 20060823 4 15 1 0.30 PRD0900032
2 200626 20060821 2 13 1 0.31 PRD0900302
3 200626 20060821 2 13 3 3.45 PRD0900331
PROD_CODE_10 PROD_CODE_20 PROD_CODE_30 PROD_CODE_40 CUST_CODE
1 CL00163 DEP00055 G00016 D00003 CUST0000472158
2 CL00037 DEP00010 G00004 D00002 CUST0000890061
3 CL00163 DEP00055 G00016 D00003 CUST0000890061
CUST_PRICE_SENSITIVITY CUST_LIFESTAGE BASKET_ID BASKET_SIZE
1 MM YF 994102000458380 L
2 MM PE 994102000719678 L
3 MM PE 994102000719678 L
BASKET_PRICE_SENSITIVITY BASKET_TYPE BASKET_DOMINANT_MISSION STORE_CODE
1 MM Top Up Mixed STORE00001
2 MM Top Up Fresh STORE00001
3 MM Top Up Fresh STORE00001
STORE_FORMAT STORE_REGION
1 LS E02
2 LS E02
3 LS E02
[ reached 'max' / getOption("max.print") -- omitted 3 rows ]
[1] 12905 22
As we can see above, we have a dataset with 22 columns and 12905 rows.
SHOP_WEEK SHOP_DATE SHOP_WEEKDAY SHOP_HOUR
Min. :200626 Min. :20060821 Min. :1.0000 Min. : 8.000
1st Qu.:200626 1st Qu.:20060824 1st Qu.:2.0000 1st Qu.:13.000
Median :200627 Median :20060828 Median :4.0000 Median :15.000
QUANTITY SPEND PROD_CODE PROD_CODE_10
Min. : 1.000 Min. : 0.010 Length:12905 Length:12905
1st Qu.: 1.000 1st Qu.: 0.730 Class :character Class :character
Median : 1.000 Median : 1.210 Mode :character Mode :character
PROD_CODE_20 PROD_CODE_30 PROD_CODE_40 CUST_CODE
Length:12905 Length:12905 Length:12905 Length:12905
Class :character Class :character Class :character Class :character
Mode :character Mode :character Mode :character Mode :character
CUST_PRICE_SENSITIVITY CUST_LIFESTAGE BASKET_ID
Length:12905 Length:12905 Min. :9.941e+14
Class :character Class :character 1st Qu.:9.941e+14
Mode :character Mode :character Median :9.941e+14
BASKET_SIZE BASKET_PRICE_SENSITIVITY BASKET_TYPE
Length:12905 Length:12905 Length:12905
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
BASKET_DOMINANT_MISSION STORE_CODE STORE_FORMAT
Length:12905 Length:12905 Length:12905
Class :character Class :character Class :character
Mode :character Mode :character Mode :character
STORE_REGION
Length:12905
Class :character
Mode :character
[ reached getOption("max.print") -- omitted 3 rows ]
As it was mentioned before, for the purpose of this paper we will use just two columns, the first one, THE BASKET_ID that is a character with the format: PRD0+6numbers, it represents a unique value for each transaction, in the data set there were many possible variables to choose, but we are interested to have the baskets. The second feature, the PROD_CODE as a numeric value with 15 digits, represents a unique id for each product. The data set transactions1, was created with these two variables
BASKET_ID PROD_CODE
1 994102000458380 PRD0900032
2 994102000719678 PRD0900302
3 994102000719678 PRD0900331
4 994102000458380 PRD0900684
5 994102000458380 PRD0900830
6 994102000458380 PRD0900927
[1] 12905 2
The dataset transaction1 was created, it has 2 columns (features), and 12905 rows. It means that we have 12904 baskets associated with each product, the transaction can be repeated, the rule is that we cannot have the same transaction with the same product.
The above result with use of the function dim() is not enough to answer questions like: How many baskets/transaction are in the data set? How many products are in the data set? But we can reply to this question with the functions length() and unique():
In the data set transactions1 there are: 2000 unique baskets/transactions
In the data set transactions1 there are: 3101 unique products
At this point, it would be interesting to create a treemap in order to see the most bought products along all the baskets:
[1] 3101 2
As we can see above there are too many products 3100, and it would be difficult to analyze something in the treemap, hence it would be better to create the rule that only the products with the frequency more than 20 will be represented, as we can see below:
[1] 64 2
64 products are selected.
The sum of the frequency should be the same as the length of the transactions1 data set.
[1] 12905
transactions1 was a continuous data, now we will begin to transform to discrete, with the help of the function ddply():
Basket number
1 994102000001422
2 994102000002105
3 994102000003300
4 994102000003606
5 994102000003857
6 994102000004284
Product codes
1 PRD0900154,PRD0901819,PRD0904263
2 PRD0900833,PRD0901648,PRD0904773
3 PRD0901131,PRD0901265,PRD0901359,PRD0901705,PRD0902051,PRD0902648
4 PRD0902008,PRD0903033,PRD0904315,PRD0904746,PRD0904896
5 PRD0900907,PRD0901329
6 PRD0902495,PRD0902640,PRD0903052,PRD0903081,PRD0904358,PRD0904478,PRD0904481
We have a new data set with discrete data, transactions2. Now we can proceed to save the data as csv file. To do this we will use write.table instead of write.csv method, as the second method does not allow to use col.names equal to false, in order not to have the column names in the csv file
Once the file is created, we can proceed to read it as transactions, creating a new discrete data set: basket, which will be used along the application of the apriori association rules algorithm.
transactions as itemMatrix in sparse format with
2000 rows (elements/itemsets/transactions) and
4875 columns (items) and a density of 0.0013235897435897
most frequent items:
PRD0903052 PRD0903678 PRD0904358 PRD0901265 PRD0901887 (Other)
268 201 165 85 65 12121
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
354 317 253 161 125 99 94 72 65 53 57 45 35 38 27 32 26 22 25 11
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 41 46 51
8 9 9 12 10 8 8 3 6 4 4 1 2 1 1 1 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.0000 2.0000 4.0000 6.4525 9.0000 51.0000
includes extended item information - examples:
labels
1 994102000001422 PRD0900154
2 994102000002105 PRD0900833
3 994102000003300 PRD0901131
There are 2000 transactions and 4875 products, the density tells us the total number of products that are purchased divided by a possible number of products in that matrix, in our case it has the value: 0.0013235897435897
While checking the summary of our transactions, we can find a valuable information:
- Firstly, we are able to calculate
how many products were purchased:
The total number of products purchased is: 12904
And it should be equal +1 to the length of rows of the original data set: 12905 22
Consequently, we will be able to check if me made any mistake in the process of transforming our data from continuous to discrete
Note that it should be equal to plus one, as in the original data frame (transactions) we are counting also the column names, but the csv file has not the column names as it was declared: col.names = FALSE, in the write.table function
- Also we can find the
most commonly bought products:
PRD0903052- 268 unitsPRD0903678- 201 unitsPRD0904358- 165 units
Finally, we can see that the average of transaction is 6.4525 and the maximum number of products in one transaction is equal to 51
Now we can create a bar chart graph plot in order to represent the different frequencies of the products. In this case we display the top 20 most frequently bought products, as a relative measure (it can be also absolute).
When we used summary() we found the same information as above, the product: PRD0903052 is the most frequently bought along the data, we can see that it appears with frequency more than 12%
The previous graph was created without fixing the support, but it is possible to fix the support. The support is the frequency of the pattern in the rule, it has been set up as 0.03, this means that the product should occur at least 3 times in 100 transactions, as we can find below there are only 6 items that meet this condition:
If we increase the support we will have more items, we can check this with support 0.05:
In this case there are only 3 products that occur at least 5 times in 100 transactions, it sense logical that the support has negative correlation with the number of products - more support means less products
Also we can find the same result as before, but in a table with the help of the algorithm eclat:
Eclat
parameter specification:
tidLists support minlen maxlen target ext
FALSE 0.03 1 15 frequent itemsets TRUE
algorithmic control:
sparse sort verbose
7 -2 TRUE
Absolute minimum support count: 60
create itemset ...
set transactions ...[4875 item(s), 2000 transaction(s)] done [0.01s].
sorting and recoding items ... [6 item(s)] done [0.00s].
creating sparse bit matrix ... [6 row(s), 2000 column(s)] done [0.00s].
writing ... [6 set(s)] done [0.00s].
Creating S4 object ... done [0.00s].
items support transIdenticalToItemsets count
[1] {PRD0903052} 0.1340 268 268
[2] {PRD0904358} 0.0825 165 165
[3] {PRD0903678} 0.1005 201 201
[4] {PRD0901265} 0.0425 85 85
[5] {PRD0904976} 0.0310 62 62
[6] {PRD0901887} 0.0325 65 65
Additionally, we can create different kind of cross tables:
As we can see in the above cross table the measure count was used, it means that we will have the information about how many times the products occur together.
- For example:
The
product PRD0903052occurred268(as the same product against each product, is the count of each product)The
product PRD0903052andPRD0903678occurred together in45 baskets
In this second table the measure support was used, it is similar to the previous result, but in this case it is relative to the total number of baskets (2000). As we could see before, the product PRD0903052 and PRD0903678 occurred together in 45 baskets, hence in this case we will have 45 divided 2000 (second row, first column)
The support of PRD0903052 and PRD0903678 products is equal to: 0.0225
We can apply a statistic association test, chi-squared. By way of this test we will be able to check if there is association or not between the products.
p-value: 5%
H0: No association between the two variables (null hypothesis)
H1: Association between the two variables (alternative hypothesis)
If value > 0.05 = Not reject H0, there is no association between the two variables If value < 0.05 = Reject H0, there is association between the two variables
[1] 4875 4875
As per the above results, it seems that at least in the head always we have a p-value lower than 5%, hence there is association between all the products. However, this applies only for the header, hence it would be convenient to check somehow the rest of the values.
As there are more than 10 million values, it would be better to create a plot, instead of storing the data in a list
In the crosstable 'chibasket' there are: 11875501 unique values
As we can see in the graph above, there is not any p-value greater than 5%, hence we can consider that there are association for all the products
3) Generating Rules:
Apriori algorithm:
Apriori
Parameter specification:
confidence minval smax arem aval originalSupport maxtime support minlen
0.4 0.1 1 none FALSE TRUE 5 0.0015 1
maxlen target ext
10 rules TRUE
Algorithmic control:
filter tree heap memopt load sort verbose
0.1 TRUE TRUE FALSE TRUE 2 TRUE
Absolute minimum support count: 3
set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[4875 item(s), 2000 transaction(s)] done [0.01s].
sorting and recoding items ... [1221 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.01s].
writing ... [410 rule(s)] done [0.01s].
creating S4 object ... done [0.00s].
The algorithm will take basket as the transaction object on which mining is to be applied. The parameter were set up as minimum support: 0.0015, and minimum confidence: 0.4
As we can see above there are 410 rules, but some of them are redundant, for this we will proceed to eliminate the redundant rules
[1] 132
We obtained 132 redundant rules, hence we should proceed to delete them from our original dataset
set of 278 rules
Finally, our dataset arbasket will have 278 rules
set of 278 rules
rule length distribution (lhs + rhs):sizes
2 3
263 15
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 2.00 2.00 2.05 2.00 3.00
summary of quality measures:
support confidence coverage lift
Min. :0.00150 Min. :0.400 Min. :0.00150 Min. : 2.99
1st Qu.:0.00150 1st Qu.:0.444 1st Qu.:0.00250 1st Qu.: 3.73
Median :0.00150 Median :0.500 Median :0.00350 Median : 4.71
Mean :0.00198 Mean :0.564 Mean :0.00377 Mean : 13.47
3rd Qu.:0.00200 3rd Qu.:0.625 3rd Qu.:0.00450 3rd Qu.: 7.93
Max. :0.00800 Max. :1.000 Max. :0.01800 Max. :181.82
count
Min. : 3.00
1st Qu.: 3.00
Median : 3.00
Mean : 3.96
3rd Qu.: 4.00
Max. :16.00
mining info:
data ntransactions support confidence
basket 2000 0.0015 0.4
We can check the
summary, in order to obtain more information:Total number of rules: 278Distribution of rules: a length of 2 products have the most rules: 263, and length of 3 products have the lowest number of rules: 15 -We can see also the summary of quality measures
-Finally there is information used for creating the rules
lhs rhs support confidence coverage lift count
[1] {PRD0904263} => {PRD0901819} 0.0015 1.00000 0.0015 181.82 3
[2] {PRD0903875} => {PRD0902924} 0.0015 0.75000 0.0020 136.36 3
[3] {PRD0903837} => {PRD0903607} 0.0015 0.75000 0.0020 125.00 3
[4] {PRD0900266} => {PRD0904489} 0.0015 0.42857 0.0035 107.14 3
[5] {PRD0900882} => {PRD0903607} 0.0015 0.60000 0.0025 100.00 3
The lift is the result of dividing the confidence, between the expected confidence
As we can see in the results above, we obtained a large lift value. As the value in this top is greater than 1, it means that the occurrence of the first product (for example, PRD0904263) has a positive effect on the occurrence of the product (PRD0901819)
We can check also the lower lift, in order to see if there is any rule lower or equal to 1, as we will have a different interpretation:
lhs rhs support confidence coverage lift count
[1] {PRD0901067} => {PRD0903052} 0.002 0.4 0.005 2.9851 4
[2] {PRD0901387} => {PRD0903052} 0.002 0.4 0.005 2.9851 4
[3] {PRD0902819} => {PRD0903052} 0.002 0.4 0.005 2.9851 4
[4] {PRD0904613} => {PRD0903052} 0.002 0.4 0.005 2.9851 4
[5] {PRD0904738} => {PRD0903052} 0.002 0.4 0.005 2.9851 4
There is not any lift value for the rules lower or equal to 1, the minimum value obtained is 2.9851, hence the interpretation of the top will be the same for all the products
Anyone who buys PRD0903052 is more than 2.9851 times more likely to buy PRD0901067/PRD0901387/PRD0902819/PRD0904613/PRD0904738 than any other client
The results are as expected, because when it was ran the chi-squared we could not reject the null hypothesis of independent products
lhs rhs support confidence coverage lift count
[1] {PRD0900302} => {PRD0903052} 0.008 0.44444 0.0180 3.3167 16
[2] {PRD0902929} => {PRD0903052} 0.007 0.45161 0.0155 3.3702 14
[3] {PRD0902728} => {PRD0903052} 0.006 0.40000 0.0150 2.9851 12
[4] {PRD0900121} => {PRD0903052} 0.005 0.43478 0.0115 3.2446 10
[5] {PRD0900173} => {PRD0903052} 0.005 0.45455 0.0110 3.3921 10
As the previous interpretations, also in this case the product PRD0903052 is present in all the results, we were expecting this, as this product is the most frequently bought. We saw before that this product appears in more than 12% of the 2000 baskets
A lot of rules have small support, but the confidence is greater or equal to 0.40
- There are
different kind of ways to plot the results:
We can create a scatter plot with the confidence and support (two dimensions), for the rules with confidence more than 0.5, 126 rules:
We can do the same but in this case with confidence, lift and support (three dimensions with the help of shading), for the rules with confidence more than 0.5, 126 rules:
It appears that when there is a higher lift, there is also a higher confidence
For the next graphs we will use a set of 50 rules (tenarbasket), with the highest confidence:
set of 50 rules
We can create a grouped matrix plot, it is similar to the previous representations, but in this case it shows the support of the rules:
Itemsets in Antecedent (LHS)
[1] "{PRD0904263}" "{PRD0903837}"
[3] "{PRD0902424}" "{PRD0903875}"
[5] "{PRD0902953}" "{PRD0904299}"
[7] "{PRD0903273}" "{PRD0901891}"
[9] "{PRD0903540}" "{PRD0900736}"
[11] "{PRD0904620}" "{PRD0900460}"
[13] "{PRD0902883}" "{PRD0900488}"
[15] "{PRD0900595}" "{PRD0902105}"
[17] "{PRD0902374}" "{PRD0902631}"
[19] "{PRD0901508}" "{PRD0901028}"
[21] "{PRD0901746}" "{PRD0903745}"
[23] "{PRD0901343}" "{PRD0904097}"
[25] "{PRD0903197}" "{PRD0902620}"
[27] "{PRD0901086}" "{PRD0900867}"
[29] "{PRD0901244}" "{PRD0901895}"
[31] "{PRD0904638}" "{PRD0904257}"
[33] "{PRD0901192}" "{PRD0904250,PRD0904976}"
[35] "{PRD0901474}" "{PRD0904142}"
[37] "{PRD0900705}" "{PRD0901413}"
[39] "{PRD0900481}" "{PRD0902117}"
[41] "{PRD0900462}" "{PRD0903722}"
[43] "{PRD0901114}" "{PRD0903083}"
[45] "{PRD0901667}" "{PRD0900830,PRD0903074}"
[47] "{PRD0900859}" "{PRD0901265,PRD0903074}"
[49] "{PRD0903776}"
Itemsets in Consequent (RHS)
[1] "{PRD0903052}" "{PRD0903678}" "{PRD0904358}" "{PRD0901265}" "{PRD0904976}"
[6] "{PRD0901887}" "{PRD0900830}" "{PRD0901383}" "{PRD0901228}" "{PRD0902929}"
[11] "{PRD0904962}" "{PRD0902242}" "{PRD0903607}" "{PRD0902924}" "{PRD0901819}"
Apart from that, we can create a matrix with 3 dimensions:
Itemsets in Antecedent (LHS)
[1] "{PRD0904263}" "{PRD0903837}"
[3] "{PRD0902424}" "{PRD0903875}"
[5] "{PRD0902953}" "{PRD0904299}"
[7] "{PRD0903273}" "{PRD0901891}"
[9] "{PRD0903540}" "{PRD0900736}"
[11] "{PRD0904620}" "{PRD0900460}"
[13] "{PRD0902883}" "{PRD0900488}"
[15] "{PRD0900595}" "{PRD0902105}"
[17] "{PRD0902374}" "{PRD0902631}"
[19] "{PRD0901508}" "{PRD0901028}"
[21] "{PRD0901746}" "{PRD0903745}"
[23] "{PRD0901343}" "{PRD0904097}"
[25] "{PRD0903197}" "{PRD0902620}"
[27] "{PRD0901086}" "{PRD0900867}"
[29] "{PRD0901244}" "{PRD0901895}"
[31] "{PRD0904638}" "{PRD0904257}"
[33] "{PRD0901192}" "{PRD0904250,PRD0904976}"
[35] "{PRD0901474}" "{PRD0904142}"
[37] "{PRD0900705}" "{PRD0901413}"
[39] "{PRD0900481}" "{PRD0902117}"
[41] "{PRD0900462}" "{PRD0903722}"
[43] "{PRD0901114}" "{PRD0903083}"
[45] "{PRD0901667}" "{PRD0900830,PRD0903074}"
[47] "{PRD0900859}" "{PRD0901265,PRD0903074}"
[49] "{PRD0903776}"
Itemsets in Consequent (RHS)
[1] "{PRD0903052}" "{PRD0903678}" "{PRD0904358}" "{PRD0901265}" "{PRD0904976}"
[6] "{PRD0901887}" "{PRD0900830}" "{PRD0901383}" "{PRD0901228}" "{PRD0902929}"
[11] "{PRD0904962}" "{PRD0902242}" "{PRD0903607}" "{PRD0902924}" "{PRD0901819}"
We have the possibility to create a grouped matrix, also for 50 rules:
Additionally, we can show dependencies with parallel coordinates plot, in this case just for the 10 rules with the highest confidence. The most red arrow represents the rule with the highest lift, as we saw before, between the product PRD0904263 and PRD0901819. Also we can see that a lot of arrow connect the product PRD0903052 on the second position:
Instead of creating the above representation we can display it as the scheme below, obtaining the same results:
Until now, we were choosing the result based on the parameters lift, support and confidence, but it will be possible also to check only the rules that implied one product, this could be helpful if we want just to have the information for one concrete product
In this case we will analyze the product: PRD0900302:
set of 175 rules
rule length distribution (lhs + rhs):sizes
2 3
161 14
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.00 2.00 2.00 2.08 2.00 3.00
summary of quality measures:
support confidence coverage lift
Min. :0.00150 Min. :0.400 Min. :0.0015 Min. :2.99
1st Qu.:0.00150 1st Qu.:0.444 1st Qu.:0.0025 1st Qu.:3.32
Median :0.00200 Median :0.500 Median :0.0035 Median :3.73
Mean :0.00217 Mean :0.574 Mean :0.0041 Mean :4.28
3rd Qu.:0.00250 3rd Qu.:0.667 3rd Qu.:0.0045 3rd Qu.:4.98
Max. :0.00800 Max. :1.000 Max. :0.0180 Max. :7.46
count
Min. : 3.00
1st Qu.: 3.00
Median : 4.00
Mean : 4.34
3rd Qu.: 5.00
Max. :16.00
mining info:
data ntransactions support confidence
basket 2000 0.0015 0.4
As we can see above, the product PRD0903052 appears in 175 rules
We are sure that there is not any redundant rule, as all of them were removed before to the data set arbasket
lhs rhs support confidence coverage lift count
[1] {PRD0901508} => {PRD0903052} 0.0015 1 0.0015 7.4627 3
[2] {PRD0901028} => {PRD0903052} 0.0015 1 0.0015 7.4627 3
[3] {PRD0901746} => {PRD0903052} 0.0015 1 0.0015 7.4627 3
[4] {PRD0903745} => {PRD0903052} 0.0015 1 0.0015 7.4627 3
[5] {PRD0901343} => {PRD0903052} 0.0015 1 0.0015 7.4627 3
[6] {PRD0904097} => {PRD0903052} 0.0015 1 0.0015 7.4627 3
[7] {PRD0903197} => {PRD0903052} 0.0015 1 0.0015 7.4627 3
[8] {PRD0902620} => {PRD0903052} 0.0015 1 0.0015 7.4627 3
[9] {PRD0901086} => {PRD0903052} 0.0015 1 0.0015 7.4627 3
[ reached 'max' / getOption("max.print") -- omitted 11 rows ]
We can represent the 175 rules obtained for the product: PRD0903052:
As we can see, when there is lower support there is larger confidence, and larger lift
4) Conclusions:
In comparison with others unsupervised machine learning methods, in association rules the fist step of data preparation is really important, and it can be really time consuming. Most of the times we have a continuous data, and in order to apply the algorithm, firstly the data should be converted to discrete.
In large datasets it is really difficult to make an interpretation from the tables or matrix, as we have several times millions of values. For this reason, the graphical analysis for the interpretation of the results is really important in this case.