Introduction

In this project we are trying to figure out market basket analyze of consumers. we use groceries data set. Data set is downloaded from Kaggle the data set has 38765 rows of the purchase from the grocery shop. These purchase will be analysed using association rules and can be generated Apriori algorithm. In R there is a package arules to calculate association rules, it makes use of the so-called Apriori algorithm. Association rule mining has a number of applications and is widely used to help discover sales correlations in transactional data. Association rule mining is a data mining technique for inter-variable linking in large data sets. The most popular example of association rule application is market basket analysis. The purpose of this application is to analyze the relationship between a customer and the highest purchased products.

firstly we install neccesary packages.

library(arules)
library(arulesViz)
library(dplyr)
library(plyr)

Data Preapearing

we are reading dataset to start project.

x <- read.csv("C:\\Users\\B.KURT\\Desktop\\usl\\Groceries_dataset.csv")



class(x)
## [1] "data.frame"

To understood our data set we are using summary function.

summary(x)
##  Member_number      Date           itemDescription   
##  Min.   :1000   Length:38765       Length:38765      
##  1st Qu.:2002   Class :character   Class :character  
##  Median :3005   Mode  :character   Mode  :character  
##  Mean   :3004                                        
##  3rd Qu.:4007                                        
##  Max.   :5000

Firstly, we are looking data using head function. The data what we have in our data sets. in our datasets there are lots of transaction what people bought from market.

head(x)
##   Member_number       Date  itemDescription
## 1          1808 21-07-2015   tropical fruit
## 2          2552 05-01-2015       whole milk
## 3          2300 19-09-2015        pip fruit
## 4          1187 12-12-2015 other vegetables
## 5          3037 01-02-2015       whole milk
## 6          4941 14-02-2015       rolls/buns
sum(is.na(x)) # we are checking is there any na values.
## [1] 0

after the checking na values, converting member number to numeric (1) and Convert item description to categorical format(2)

sorted <- x[order(x$Member_number),] #1
sorted$Member_number <- as.numeric(sorted$Member_number) #2
str(sorted)
## 'data.frame':    38765 obs. of  3 variables:
##  $ Member_number  : num  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
##  $ Date           : chr  "27-05-2015" "24-07-2015" "15-03-2015" "25-11-2015" ...
##  $ itemDescription: chr  "soda" "canned beer" "sausage" "sausage" ...

Before the Convert CSV file to Basket Format we should group all items bought together by the same customer(1) on the same date and remove member and date(2)

itemList <- ddply(sorted, c("Member_number","Date"), function(df1)paste(df1$itemDescription,collapse = ",")) #1
head(itemList)
##   Member_number       Date                                            V1
## 1          1000 15-03-2015 sausage,whole milk,semi-finished bread,yogurt
## 2          1000 24-06-2014                 whole milk,pastry,salty snack
## 3          1000 24-07-2015                   canned beer,misc. beverages
## 4          1000 25-11-2015                      sausage,hygiene articles
## 5          1000 27-05-2015                       soda,pickled vegetables
## 6          1001 02-05-2015                              frankfurter,curd
itemList$Member_number <- NULL
itemList$Date <- NULL
colnames(itemList) <- c("itemList")  

write.csv(itemList,"ItemList.csv", quote = FALSE, row.names = TRUE)#2
head(itemList)
##                                        itemList
## 1 sausage,whole milk,semi-finished bread,yogurt
## 2                 whole milk,pastry,salty snack
## 3                   canned beer,misc. beverages
## 4                      sausage,hygiene articles
## 5                       soda,pickled vegetables
## 6                              frankfurter,curd

How association rules work

“An association rule has two parts: an antecedent (if) and a consequent (then). An antecedent is an item found within the data. A consequent is an item found in combination with the antecedent. Association rules are created by searching data for frequent if-then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the data. Confidence indicates the number of times the if-then statements are found true. A third metric, called lift, can be used to compare confidence with expected confidence, or how many times an if-then statement is expected to be found true. Association rules are calculated from itemsets, which are made up of two or more items. If rules are built from analyzing all the possible itemsets, there could be so many rules that the rules hold little meaning. With that, association rules are typically created from rules well-represented in data.”(https://searchbusinessanalytics.techtarget.com/definition/association-rules-in-data-mining )

burhansbasket = read.transactions(file="ItemList.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1);
## distribution of transactions with duplicates:
## items
##   1   2   3   4 
## 662  39   5   1
print(burhansbasket)
## transactions in sparse format with
##  14964 transactions (rows) and
##  168 items (columns)

in the above we can see there are 14694 transiction and 168 items. now i will remove quotes from transaction for using Apriori algorithm.

burhansbasket@itemInfo$labels <- gsub("\"","",burhansbasket@itemInfo$labels) 

Creating Basket Rules

basket_rules1 <- apriori(burhansbasket, parameter = list(minlen=2, sup = 0.001, conf = 0.05, target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.05    0.1    1 none FALSE            TRUE       5   0.001      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 14 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.01s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [450 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(basket_rules1)
## set of 450 rules
## 
## rule length distribution (lhs + rhs):sizes
##   2   3 
## 423  27 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    2.00    2.00    2.00    2.06    2.00    3.00 
## 
## summary of quality measures:
##     support           confidence         coverage             lift       
##  Min.   :0.001002   Min.   :0.05000   Min.   :0.005346   Min.   :0.5195  
##  1st Qu.:0.001270   1st Qu.:0.06397   1st Qu.:0.015972   1st Qu.:0.7673  
##  Median :0.001938   Median :0.08108   Median :0.023590   Median :0.8350  
##  Mean   :0.002760   Mean   :0.08759   Mean   :0.033723   Mean   :0.8859  
##  3rd Qu.:0.003341   3rd Qu.:0.10482   3rd Qu.:0.043705   3rd Qu.:0.9601  
##  Max.   :0.014836   Max.   :0.25581   Max.   :0.157912   Max.   :2.1831  
##      count      
##  Min.   : 15.0  
##  1st Qu.: 19.0  
##  Median : 29.0  
##  Mean   : 41.3  
##  3rd Qu.: 50.0  
##  Max.   :222.0  
## 
## mining info:
##           data ntransactions support confidence
##  burhansbasket         14964   0.001       0.05

in here we found 450 rules in among the our customers transaction so its a little much for trusting resulsts there for we should Changing hyperparameters

basket_rules2 <- apriori(burhansbasket, parameter = list(minlen=3, sup = 0.001, conf = 0.01, target="rules"))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##        0.01    0.1    1 none FALSE            TRUE       5   0.001      3
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 14 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [27 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
summary(basket_rules2)
## set of 27 rules
## 
## rule length distribution (lhs + rhs):sizes
##  3 
## 27 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       3       3       3       3       3       3 
## 
## summary of quality measures:
##     support           confidence         coverage             lift       
##  Min.   :0.001002   Min.   :0.07177   Min.   :0.005346   Min.   :0.7054  
##  1st Qu.:0.001136   1st Qu.:0.08908   1st Qu.:0.008520   1st Qu.:0.7868  
##  Median :0.001136   Median :0.11724   Median :0.010559   Median :1.0825  
##  Mean   :0.001181   Mean   :0.12181   Mean   :0.010564   Mean   :1.0919  
##  3rd Qu.:0.001203   3rd Qu.:0.13612   3rd Qu.:0.012797   3rd Qu.:1.1915  
##  Max.   :0.001470   Max.   :0.25581   Max.   :0.014836   Max.   :2.1831  
##      count      
##  Min.   :15.00  
##  1st Qu.:17.00  
##  Median :17.00  
##  Mean   :17.67  
##  3rd Qu.:18.00  
##  Max.   :22.00  
## 
## mining info:
##           data ntransactions support confidence
##  burhansbasket         14964   0.001       0.01

Now there are 27 rules so maybe we can decide from this transaction whatWhat did he take with the one he bought at the grocery store. To be sure we Visualizing the Association Rules before the making decisions.

Visualizing

inspect(basket_rules2[1:20])
##      lhs                              rhs                support     confidence
## [1]  {sausage,yogurt}              => {whole milk}       0.001470195 0.25581395
## [2]  {sausage,whole milk}          => {yogurt}           0.001470195 0.16417910
## [3]  {whole milk,yogurt}           => {sausage}          0.001470195 0.13173653
## [4]  {sausage,soda}                => {whole milk}       0.001069233 0.17977528
## [5]  {sausage,whole milk}          => {soda}             0.001069233 0.11940299
## [6]  {soda,whole milk}             => {sausage}          0.001069233 0.09195402
## [7]  {rolls/buns,sausage}          => {whole milk}       0.001136060 0.21250000
## [8]  {sausage,whole milk}          => {rolls/buns}       0.001136060 0.12686567
## [9]  {rolls/buns,whole milk}       => {sausage}          0.001136060 0.08133971
## [10] {rolls/buns,yogurt}           => {whole milk}       0.001336541 0.17094017
## [11] {whole milk,yogurt}           => {rolls/buns}       0.001336541 0.11976048
## [12] {rolls/buns,whole milk}       => {yogurt}           0.001336541 0.09569378
## [13] {other vegetables,yogurt}     => {whole milk}       0.001136060 0.14049587
## [14] {whole milk,yogurt}           => {other vegetables} 0.001136060 0.10179641
## [15] {other vegetables,whole milk} => {yogurt}           0.001136060 0.07657658
## [16] {rolls/buns,soda}             => {other vegetables} 0.001136060 0.14049587
## [17] {other vegetables,soda}       => {rolls/buns}       0.001136060 0.11724138
## [18] {other vegetables,rolls/buns} => {soda}             0.001136060 0.10759494
## [19] {rolls/buns,soda}             => {whole milk}       0.001002406 0.12396694
## [20] {soda,whole milk}             => {rolls/buns}       0.001002406 0.08620690
##      coverage    lift      count
## [1]  0.005747126 1.6199746 22   
## [2]  0.008954825 1.9118880 22   
## [3]  0.011160118 2.1830624 22   
## [4]  0.005947608 1.1384500 16   
## [5]  0.008954825 1.2296946 16   
## [6]  0.011627907 1.5238095 16   
## [7]  0.005346164 1.3456835 17   
## [8]  0.008954825 1.1533523 17   
## [9]  0.013966854 1.3479152 17   
## [10] 0.007818765 1.0825005 20   
## [11] 0.011160118 1.0887581 20   
## [12] 0.013966854 1.1143671 20   
## [13] 0.008086073 0.8897081 17   
## [14] 0.011160118 0.8337610 17   
## [15] 0.014835605 0.8917447 17   
## [16] 0.008086073 1.1507281 17   
## [17] 0.009689922 1.0658566 17   
## [18] 0.010558674 1.1080872 17   
## [19] 0.008086073 0.7850365 15   
## [20] 0.011627907 0.7837181 15

in the above you can see rules.

plot(basket_rules2, method = "grouped", control = list(k = 5))

plot(basket_rules2[1:20], method="graph")

plot(basket_rules2[1:20], method="paracoord")

Conclusion

Market Basket Analysis is effectively implemented by retailers in particular to develop marketing strategies by analyzing customer purchasing habits.Association rule inn algorithms, such as Apriori, are very useful for finding simple associations between data elements.In our project we also use Apriori and as a result of project we can say There are rules that make sense. For example, “sausage”, “yoghurt”, and “whole milk” seem to have strong connection.if we desig an apllication for market when someone would buy sausage and yoghur we can fix them buying milks. In the real life its very important project for datascientist.