1) `Project description`:

The goal of this paper is to learn how to apply apriori association rules algorithm with R.

In this paper we will analyze a random dataset generated with the help of the website: https://www.dunnhumby.com/. It contains 2000 different supermarket baskets/transactions in two different periods of time, with other additional information. For the purposes of this paper, the columns PROD_CODE (character, with the format: PRD0+6numbers) and BASKET_ID (number with 15 digits) will be selected.

Association rules can be applied in many ways, one of the most popular one appears in a shopping baskets data sets, as it can bring valuable information to the shops, and it can answer many questions like: How the catalog of products should be designed? Where each product should be located in the shop? Which products should receive a promotion? Additionally, we can predict customer behavior.

2) `Manipulation of the data`:

First we will load all the necesary packages and the data:

library(arules)
library(plyr)
library(RColorBrewer)
library(arulesViz)
library(Matrix)
library(plot3D)
library(rgl)
library(OceanView)
library(rgl)
library(plot3Drgl)
library(reshape2)
library(treemap)
library(reshape2)

In association rules the manipulation of the data is really important, we should have discrete data.

In this case we have continuous data, hence the first step will be to transform the data from continuous to discrete.

First of all we should import the two csv files, each one represents the basket of one week:

  SHOP_WEEK SHOP_DATE SHOP_WEEKDAY SHOP_HOUR QUANTITY SPEND  PROD_CODE
1    200626  20060823            4        15        1  0.30 PRD0900032
2    200626  20060821            2        13        1  0.31 PRD0900302
3    200626  20060821            2        13        3  3.45 PRD0900331
  PROD_CODE_10 PROD_CODE_20 PROD_CODE_30 PROD_CODE_40      CUST_CODE
1      CL00163     DEP00055       G00016       D00003 CUST0000472158
2      CL00037     DEP00010       G00004       D00002 CUST0000890061
3      CL00163     DEP00055       G00016       D00003 CUST0000890061
  CUST_PRICE_SENSITIVITY CUST_LIFESTAGE       BASKET_ID BASKET_SIZE
1                     MM             YF 994102000458380           L
2                     MM             PE 994102000719678           L
3                     MM             PE 994102000719678           L
  BASKET_PRICE_SENSITIVITY BASKET_TYPE BASKET_DOMINANT_MISSION STORE_CODE
1                       MM      Top Up                   Mixed STORE00001
2                       MM      Top Up                   Fresh STORE00001
3                       MM      Top Up                   Fresh STORE00001
  STORE_FORMAT STORE_REGION
1           LS          E02
2           LS          E02
3           LS          E02
 [ reached 'max' / getOption("max.print") -- omitted 3 rows ]

[1] 12905    22

As we can see above, we have a dataset with 22 columns and 12905 rows.

   SHOP_WEEK        SHOP_DATE         SHOP_WEEKDAY      SHOP_HOUR     
 Min.   :200626   Min.   :20060821   Min.   :1.0000   Min.   : 8.000  
 1st Qu.:200626   1st Qu.:20060824   1st Qu.:2.0000   1st Qu.:13.000  
 Median :200627   Median :20060828   Median :4.0000   Median :15.000  
    QUANTITY          SPEND         PROD_CODE         PROD_CODE_10      
 Min.   : 1.000   Min.   : 0.010   Length:12905       Length:12905      
 1st Qu.: 1.000   1st Qu.: 0.730   Class :character   Class :character  
 Median : 1.000   Median : 1.210   Mode  :character   Mode  :character  
 PROD_CODE_20       PROD_CODE_30       PROD_CODE_40        CUST_CODE        
 Length:12905       Length:12905       Length:12905       Length:12905      
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
 CUST_PRICE_SENSITIVITY CUST_LIFESTAGE       BASKET_ID        
 Length:12905           Length:12905       Min.   :9.941e+14  
 Class :character       Class :character   1st Qu.:9.941e+14  
 Mode  :character       Mode  :character   Median :9.941e+14  
 BASKET_SIZE        BASKET_PRICE_SENSITIVITY BASKET_TYPE       
 Length:12905       Length:12905             Length:12905      
 Class :character   Class :character         Class :character  
 Mode  :character   Mode  :character         Mode  :character  
 BASKET_DOMINANT_MISSION  STORE_CODE        STORE_FORMAT      
 Length:12905            Length:12905       Length:12905      
 Class :character        Class :character   Class :character  
 Mode  :character        Mode  :character   Mode  :character  
 STORE_REGION      
 Length:12905      
 Class :character  
 Mode  :character  
 [ reached getOption("max.print") -- omitted 3 rows ]

As it was mentioned before, for the purpose of this paper we will use just two columns, the first one, THE BASKET_ID that is a character with the format: PRD0+6numbers, it represents a unique value for each transaction, in the data set there were many possible variables to choose, but we are interested to have the baskets. The second feature, the PROD_CODE as a numeric value with 15 digits, represents a unique id for each product. The data set transactions1, was created with these two variables

        BASKET_ID  PROD_CODE
1 994102000458380 PRD0900032
2 994102000719678 PRD0900302
3 994102000719678 PRD0900331
4 994102000458380 PRD0900684
5 994102000458380 PRD0900830
6 994102000458380 PRD0900927

[1] 12905     2

The dataset transaction1 was created, it has 2 columns (features), and 12905 rows. It means that we have 12904 baskets associated with each product, the transaction can be repeated, the rule is that we cannot have the same transaction with the same product.

The above result with use of the function dim() is not enough to answer questions like: How many baskets/transaction are in the data set? How many products are in the data set? But we can reply to this question with the functions length() and unique():

In the data set transactions1 there are: 2000 unique baskets/transactions


In the data set transactions1 there are: 3101 unique products

At this point, it would be interesting to create a treemap in order to see the most bought products along all the baskets:

[1] 3101    2

As we can see above there are too many products 3100, and it would be difficult to analyze something in the treemap, hence it would be better to create the rule that only the products with the frequency more than 20 will be represented, as we can see below:

[1] 64  2

64 products are selected.

The sum of the frequency should be the same as the length of the transactions1 data set.

[1] 12905

transactions1 was a continuous data, now we will begin to transform to discrete, with the help of the function ddply():

    Basket number
1 994102000001422
2 994102000002105
3 994102000003300
4 994102000003606
5 994102000003857
6 994102000004284
                                                                 Product codes
1                                             PRD0900154,PRD0901819,PRD0904263
2                                             PRD0900833,PRD0901648,PRD0904773
3            PRD0901131,PRD0901265,PRD0901359,PRD0901705,PRD0902051,PRD0902648
4                       PRD0902008,PRD0903033,PRD0904315,PRD0904746,PRD0904896
5                                                        PRD0900907,PRD0901329
6 PRD0902495,PRD0902640,PRD0903052,PRD0903081,PRD0904358,PRD0904478,PRD0904481

We have a new data set with discrete data, transactions2. Now we can proceed to save the data as csv file. To do this we will use write.table instead of write.csv method, as the second method does not allow to use col.names equal to false, in order not to have the column names in the csv file

Once the file is created, we can proceed to read it as transactions, creating a new discrete data set: basket, which will be used along the application of the apriori association rules algorithm.

transactions as itemMatrix in sparse format with
 2000 rows (elements/itemsets/transactions) and
 4875 columns (items) and a density of 0.0013235897435897 

most frequent items:
PRD0903052 PRD0903678 PRD0904358 PRD0901265 PRD0901887    (Other) 
       268        201        165         85         65      12121 

element (itemset/transaction) length distribution:
sizes
  1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
354 317 253 161 125  99  94  72  65  53  57  45  35  38  27  32  26  22  25  11 
 21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  41  46  51 
  8   9   9  12  10   8   8   3   6   4   4   1   2   1   1   1   1   1 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 1.0000  2.0000  4.0000  6.4525  9.0000 51.0000 

includes extended item information - examples:
                      labels
1 994102000001422 PRD0900154
2 994102000002105 PRD0900833
3 994102000003300 PRD0901131

There are 2000 transactions and 4875 products, the density tells us the total number of products that are purchased divided by a possible number of products in that matrix, in our case it has the value: 0.0013235897435897

While checking the summary of our transactions, we can find a valuable information:

Firstly, we are able to calculate how many products were purchased:

The total number of products purchased is: 12904 
And it should be equal +1 to the length of rows of the original data set: 12905 22

Consequently, we will be able to check if me made any mistake in the process of transforming our data from continuous to discrete

Note that it should be equal to plus one, as in the original data frame (transactions) we are counting also the column names, but the csv file has not the column names as it was declared: col.names = FALSE, in the write.table function

Also we can find the most commonly bought products:

PRD0903052 - 268 units
PRD0903678 - 201 units
PRD0904358 - 165 units

Finally, we can see that the average of transaction is 6.4525 and the maximum number of products in one transaction is equal to 51

Now we can create a bar chart graph plot in order to represent the different frequencies of the products. In this case we display the top 20 most frequently bought products, as a relative measure (it can be also absolute).

When we used summary() we found the same information as above, the product: PRD0903052 is the most frequently bought along the data, we can see that it appears with frequency more than 12%

The previous graph was created without fixing the support, but it is possible to fix the support. The support is the frequency of the pattern in the rule, it has been set up as 0.03, this means that the product should occur at least 3 times in 100 transactions, as we can find below there are only 6 items that meet this condition:

If we increase the support we will have more items, we can check this with support 0.05:

In this case there are only 3 products that occur at least 5 times in 100 transactions, it sense logical that the support has negative correlation with the number of products - more support means less products

Also we can find the same result as before, but in a table with the help of the algorithm eclat:

Eclat

parameter specification:
 tidLists support minlen maxlen            target  ext
    FALSE    0.03      1     15 frequent itemsets TRUE

algorithmic control:
 sparse sort verbose
      7   -2    TRUE

Absolute minimum support count: 60 

create itemset ... 
set transactions ...[4875 item(s), 2000 transaction(s)] done [0.01s].
sorting and recoding items ... [6 item(s)] done [0.00s].
creating sparse bit matrix ... [6 row(s), 2000 column(s)] done [0.00s].
writing  ... [6 set(s)] done [0.00s].
Creating S4 object  ... done [0.00s].

    items        support transIdenticalToItemsets count
[1] {PRD0903052} 0.1340  268                      268  
[2] {PRD0904358} 0.0825  165                      165  
[3] {PRD0903678} 0.1005  201                      201  
[4] {PRD0901265} 0.0425   85                       85  
[5] {PRD0904976} 0.0310   62                       62  
[6] {PRD0901887} 0.0325   65                       65

Additionally, we can create different kind of cross tables:

ctbasket <- crossTable(basket, 
                       measure="count", 
                       sort=TRUE)

As we can see in the above cross table the measure count was used, it means that we will have the information about how many times the products occur together.

For example:

The product PRD0903052 occurred 268 (as the same product against each product, is the count of each product)
The product PRD0903052 and PRD0903678 occurred together in 45 baskets

stbasket <- crossTable(basket, 
                       measure="support", 
                       sort=TRUE)

In this second table the measure support was used, it is similar to the previous result, but in this case it is relative to the total number of baskets (2000). As we could see before, the product PRD0903052 and PRD0903678 occurred together in 45 baskets, hence in this case we will have 45 divided 2000 (second row, first column)

The support of  PRD0903052 and PRD0903678 products is equal to: 0.0225

We can apply a statistic association test, chi-squared. By way of this test we will be able to check if there is association or not between the products.

p-value: 5%

H0: No association between the two variables (null hypothesis)

H1: Association between the two variables (alternative hypothesis)

If value > 0.05 = Not reject H0, there is no association between the two variables If value < 0.05 = Reject H0, there is association between the two variables

chibasket <- crossTable(basket, 
                        measure="chiSquared", 
                        sort=TRUE)

[1] 4875 4875

As per the above results, it seems that at least in the head always we have a p-value lower than 5%, hence there is association between all the products. However, this applies only for the header, hence it would be convenient to check somehow the rest of the values.

As there are more than 10 million values, it would be better to create a plot, instead of storing the data in a list

In the crosstable 'chibasket' there are: 11875501 unique values

As we can see in the graph above, there is not any p-value greater than 5%, hence we can consider that there are association for all the products

3) `Generating Rules`:

Apriori algorithm:

Apriori

Parameter specification:
 confidence minval smax arem  aval originalSupport maxtime support minlen
        0.4    0.1    1 none FALSE            TRUE       5  0.0015      1
 maxlen target  ext
     10  rules TRUE

Algorithmic control:
 filter tree heap memopt load sort verbose
    0.1 TRUE TRUE  FALSE TRUE    2    TRUE

Absolute minimum support count: 3 

set item appearances ...[0 item(s)] done [0.00s].
set transactions ...[4875 item(s), 2000 transaction(s)] done [0.01s].
sorting and recoding items ... [1221 item(s)] done [0.00s].
creating transaction tree ... done [0.00s].
checking subsets of size 1 2 3 4 done [0.01s].
writing ... [410 rule(s)] done [0.01s].
creating S4 object  ... done [0.00s].

The algorithm will take basket as the transaction object on which mining is to be applied. The parameter were set up as minimum support: 0.0015, and minimum confidence: 0.4

As we can see above there are 410 rules, but some of them are redundant, for this we will proceed to eliminate the redundant rules

[1] 132

We obtained 132 redundant rules, hence we should proceed to delete them from our original dataset

set of 278 rules

Finally, our dataset arbasket will have 278 rules

set of 278 rules

rule length distribution (lhs + rhs):sizes
  2   3 
263  15 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    2.00    2.00    2.05    2.00    3.00 

summary of quality measures:
    support          confidence       coverage            lift       
 Min.   :0.00150   Min.   :0.400   Min.   :0.00150   Min.   :  2.99  
 1st Qu.:0.00150   1st Qu.:0.444   1st Qu.:0.00250   1st Qu.:  3.73  
 Median :0.00150   Median :0.500   Median :0.00350   Median :  4.71  
 Mean   :0.00198   Mean   :0.564   Mean   :0.00377   Mean   : 13.47  
 3rd Qu.:0.00200   3rd Qu.:0.625   3rd Qu.:0.00450   3rd Qu.:  7.93  
 Max.   :0.00800   Max.   :1.000   Max.   :0.01800   Max.   :181.82  
     count      
 Min.   : 3.00  
 1st Qu.: 3.00  
 Median : 3.00  
 Mean   : 3.96  
 3rd Qu.: 4.00  
 Max.   :16.00  

mining info:
   data ntransactions support confidence
 basket          2000  0.0015        0.4

We can check the summary, in order to obtain more information:
Total number of rules: 278
Distribution of rules: a length of 2 products have the most rules: 263, and length of 3 products have the lowest number of rules: 15 -We can see also the summary of quality measures

-Finally there is information used for creating the rules

    lhs             rhs          support confidence coverage lift   count
[1] {PRD0904263} => {PRD0901819} 0.0015  1.00000    0.0015   181.82 3    
[2] {PRD0903875} => {PRD0902924} 0.0015  0.75000    0.0020   136.36 3    
[3] {PRD0903837} => {PRD0903607} 0.0015  0.75000    0.0020   125.00 3    
[4] {PRD0900266} => {PRD0904489} 0.0015  0.42857    0.0035   107.14 3    
[5] {PRD0900882} => {PRD0903607} 0.0015  0.60000    0.0025   100.00 3

The lift is the result of dividing the confidence, between the expected confidence

As we can see in the results above, we obtained a large lift value. As the value in this top is greater than 1, it means that the occurrence of the first product (for example, PRD0904263) has a positive effect on the occurrence of the product (PRD0901819)

We can check also the lower lift, in order to see if there is any rule lower or equal to 1, as we will have a different interpretation:

    lhs             rhs          support confidence coverage lift   count
[1] {PRD0901067} => {PRD0903052} 0.002   0.4        0.005    2.9851 4    
[2] {PRD0901387} => {PRD0903052} 0.002   0.4        0.005    2.9851 4    
[3] {PRD0902819} => {PRD0903052} 0.002   0.4        0.005    2.9851 4    
[4] {PRD0904613} => {PRD0903052} 0.002   0.4        0.005    2.9851 4    
[5] {PRD0904738} => {PRD0903052} 0.002   0.4        0.005    2.9851 4

There is not any lift value for the rules lower or equal to 1, the minimum value obtained is 2.9851, hence the interpretation of the top will be the same for all the products

Anyone who buys PRD0903052 is more than 2.9851 times more likely to buy PRD0901067/PRD0901387/PRD0902819/PRD0904613/PRD0904738 than any other client

The results are as expected, because when it was ran the chi-squared we could not reject the null hypothesis of independent products

    lhs             rhs          support confidence coverage lift   count
[1] {PRD0900302} => {PRD0903052} 0.008   0.44444    0.0180   3.3167 16   
[2] {PRD0902929} => {PRD0903052} 0.007   0.45161    0.0155   3.3702 14   
[3] {PRD0902728} => {PRD0903052} 0.006   0.40000    0.0150   2.9851 12   
[4] {PRD0900121} => {PRD0903052} 0.005   0.43478    0.0115   3.2446 10   
[5] {PRD0900173} => {PRD0903052} 0.005   0.45455    0.0110   3.3921 10

As the previous interpretations, also in this case the product PRD0903052 is present in all the results, we were expecting this, as this product is the most frequently bought. We saw before that this product appears in more than 12% of the 2000 baskets

A lot of rules have small support, but the confidence is greater or equal to 0.40

There are different kind of ways to plot the results:

We can create a scatter plot with the confidence and support (two dimensions), for the rules with confidence more than 0.5, 126 rules:

We can do the same but in this case with confidence, lift and support (three dimensions with the help of shading), for the rules with confidence more than 0.5, 126 rules:

It appears that when there is a higher lift, there is also a higher confidence

For the next graphs we will use a set of 50 rules (tenarbasket), with the highest confidence:

set of 50 rules

We can create a grouped matrix plot, it is similar to the previous representations, but in this case it shows the support of the rules:

Itemsets in Antecedent (LHS)
 [1] "{PRD0904263}"            "{PRD0903837}"           
 [3] "{PRD0902424}"            "{PRD0903875}"           
 [5] "{PRD0902953}"            "{PRD0904299}"           
 [7] "{PRD0903273}"            "{PRD0901891}"           
 [9] "{PRD0903540}"            "{PRD0900736}"           
[11] "{PRD0904620}"            "{PRD0900460}"           
[13] "{PRD0902883}"            "{PRD0900488}"           
[15] "{PRD0900595}"            "{PRD0902105}"           
[17] "{PRD0902374}"            "{PRD0902631}"           
[19] "{PRD0901508}"            "{PRD0901028}"           
[21] "{PRD0901746}"            "{PRD0903745}"           
[23] "{PRD0901343}"            "{PRD0904097}"           
[25] "{PRD0903197}"            "{PRD0902620}"           
[27] "{PRD0901086}"            "{PRD0900867}"           
[29] "{PRD0901244}"            "{PRD0901895}"           
[31] "{PRD0904638}"            "{PRD0904257}"           
[33] "{PRD0901192}"            "{PRD0904250,PRD0904976}"
[35] "{PRD0901474}"            "{PRD0904142}"           
[37] "{PRD0900705}"            "{PRD0901413}"           
[39] "{PRD0900481}"            "{PRD0902117}"           
[41] "{PRD0900462}"            "{PRD0903722}"           
[43] "{PRD0901114}"            "{PRD0903083}"           
[45] "{PRD0901667}"            "{PRD0900830,PRD0903074}"
[47] "{PRD0900859}"            "{PRD0901265,PRD0903074}"
[49] "{PRD0903776}"           
Itemsets in Consequent (RHS)
 [1] "{PRD0903052}" "{PRD0903678}" "{PRD0904358}" "{PRD0901265}" "{PRD0904976}"
 [6] "{PRD0901887}" "{PRD0900830}" "{PRD0901383}" "{PRD0901228}" "{PRD0902929}"
[11] "{PRD0904962}" "{PRD0902242}" "{PRD0903607}" "{PRD0902924}" "{PRD0901819}"

Apart from that, we can create a matrix with 3 dimensions:

Itemsets in Antecedent (LHS)
 [1] "{PRD0904263}"            "{PRD0903837}"           
 [3] "{PRD0902424}"            "{PRD0903875}"           
 [5] "{PRD0902953}"            "{PRD0904299}"           
 [7] "{PRD0903273}"            "{PRD0901891}"           
 [9] "{PRD0903540}"            "{PRD0900736}"           
[11] "{PRD0904620}"            "{PRD0900460}"           
[13] "{PRD0902883}"            "{PRD0900488}"           
[15] "{PRD0900595}"            "{PRD0902105}"           
[17] "{PRD0902374}"            "{PRD0902631}"           
[19] "{PRD0901508}"            "{PRD0901028}"           
[21] "{PRD0901746}"            "{PRD0903745}"           
[23] "{PRD0901343}"            "{PRD0904097}"           
[25] "{PRD0903197}"            "{PRD0902620}"           
[27] "{PRD0901086}"            "{PRD0900867}"           
[29] "{PRD0901244}"            "{PRD0901895}"           
[31] "{PRD0904638}"            "{PRD0904257}"           
[33] "{PRD0901192}"            "{PRD0904250,PRD0904976}"
[35] "{PRD0901474}"            "{PRD0904142}"           
[37] "{PRD0900705}"            "{PRD0901413}"           
[39] "{PRD0900481}"            "{PRD0902117}"           
[41] "{PRD0900462}"            "{PRD0903722}"           
[43] "{PRD0901114}"            "{PRD0903083}"           
[45] "{PRD0901667}"            "{PRD0900830,PRD0903074}"
[47] "{PRD0900859}"            "{PRD0901265,PRD0903074}"
[49] "{PRD0903776}"           
Itemsets in Consequent (RHS)
 [1] "{PRD0903052}" "{PRD0903678}" "{PRD0904358}" "{PRD0901265}" "{PRD0904976}"
 [6] "{PRD0901887}" "{PRD0900830}" "{PRD0901383}" "{PRD0901228}" "{PRD0902929}"
[11] "{PRD0904962}" "{PRD0902242}" "{PRD0903607}" "{PRD0902924}" "{PRD0901819}"

We have the possibility to create a grouped matrix, also for 50 rules:

Additionally, we can show dependencies with parallel coordinates plot, in this case just for the 10 rules with the highest confidence. The most red arrow represents the rule with the highest lift, as we saw before, between the product PRD0904263 and PRD0901819. Also we can see that a lot of arrow connect the product PRD0903052 on the second position:

Instead of creating the above representation we can display it as the scheme below, obtaining the same results:

Until now, we were choosing the result based on the parameters lift, support and confidence, but it will be possible also to check only the rules that implied one product, this could be helpful if we want just to have the information for one concrete product

In this case we will analyze the product: PRD0900302:

set of 175 rules

rule length distribution (lhs + rhs):sizes
  2   3 
161  14 

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.00    2.00    2.00    2.08    2.00    3.00 

summary of quality measures:
    support          confidence       coverage           lift     
 Min.   :0.00150   Min.   :0.400   Min.   :0.0015   Min.   :2.99  
 1st Qu.:0.00150   1st Qu.:0.444   1st Qu.:0.0025   1st Qu.:3.32  
 Median :0.00200   Median :0.500   Median :0.0035   Median :3.73  
 Mean   :0.00217   Mean   :0.574   Mean   :0.0041   Mean   :4.28  
 3rd Qu.:0.00250   3rd Qu.:0.667   3rd Qu.:0.0045   3rd Qu.:4.98  
 Max.   :0.00800   Max.   :1.000   Max.   :0.0180   Max.   :7.46  
     count      
 Min.   : 3.00  
 1st Qu.: 3.00  
 Median : 4.00  
 Mean   : 4.34  
 3rd Qu.: 5.00  
 Max.   :16.00  

mining info:
   data ntransactions support confidence
 basket          2000  0.0015        0.4

As we can see above, the product PRD0903052 appears in 175 rules

We are sure that there is not any redundant rule, as all of them were removed before to the data set arbasket

    lhs             rhs          support confidence coverage lift   count
[1] {PRD0901508} => {PRD0903052} 0.0015  1          0.0015   7.4627 3    
[2] {PRD0901028} => {PRD0903052} 0.0015  1          0.0015   7.4627 3    
[3] {PRD0901746} => {PRD0903052} 0.0015  1          0.0015   7.4627 3    
[4] {PRD0903745} => {PRD0903052} 0.0015  1          0.0015   7.4627 3    
[5] {PRD0901343} => {PRD0903052} 0.0015  1          0.0015   7.4627 3    
[6] {PRD0904097} => {PRD0903052} 0.0015  1          0.0015   7.4627 3    
[7] {PRD0903197} => {PRD0903052} 0.0015  1          0.0015   7.4627 3    
[8] {PRD0902620} => {PRD0903052} 0.0015  1          0.0015   7.4627 3    
[9] {PRD0901086} => {PRD0903052} 0.0015  1          0.0015   7.4627 3    
 [ reached 'max' / getOption("max.print") -- omitted 11 rows ]

We can represent the 175 rules obtained for the product: PRD0903052:

As we can see, when there is lower support there is larger confidence, and larger lift

4) `Conclusions`:

In comparison with others unsupervised machine learning methods, in association rules the fist step of data preparation is really important, and it can be really time consuming. Most of the times we have a continuous data, and in order to apply the algorithm, firstly the data should be converted to discrete.

In large datasets it is really difficult to make an interpretation from the tables or matrix, as we have several times millions of values. For this reason, the graphical analysis for the interpretation of the results is really important in this case.

5) `Data set bibliography`:

https://www.dunnhumby.com/careers/engineering/sourcefiles?sourcefile=https%3A//www.dunnhumby.com/sites/default/files/sourcefiles/dunnhumby_Let%27s-Get-Sort-of-Real-%28Sample-2K-baskets%29.zip

Applying Association rules on 2000 supermarket baskets

Applying Association rules on 2000 supermarket baskets

1) `Project description`:

2) `Manipulation of the data`:

3) `Generating Rules`:

4) `Conclusions`:

5) `Data set bibliography`:

Applying Association rules on 2000 supermarket baskets

Applying Association rules on 2000 supermarket baskets

1) Project description:

2) Manipulation of the data:

3) Generating Rules:

4) Conclusions:

5) Data set bibliography:

1) `Project description`:

2) `Manipulation of the data`:

3) `Generating Rules`:

4) `Conclusions`:

5) `Data set bibliography`: