In this project we are trying to figure out market basket analyze of consumers. we use groceries data set. Data set is downloaded from Kaggle the data set has 38765 rows of the purchase from the grocery shop. These purchase will be analysed using association rules and can be generated Apriori algorithm. In R there is a package arules to calculate association rules, it makes use of the so-called Apriori algorithm. Association rule mining has a number of applications and is widely used to help discover sales correlations in transactional data. Association rule mining is a data mining technique for inter-variable linking in large data sets. The most popular example of association rule application is market basket analysis. The purpose of this application is to analyze the relationship between a customer and the highest purchased products.
firstly we install neccesary packages.
we are reading dataset to start project.
## [1] "data.frame"
To understood our data set we are using summary function.
## Member_number Date itemDescription
## Min. :1000 Length:38765 Length:38765
## 1st Qu.:2002 Class :character Class :character
## Median :3005 Mode :character Mode :character
## Mean :3004
## 3rd Qu.:4007
## Max. :5000
Firstly, we are looking data using head function. The data what we have in our data sets. in our datasets there are lots of transaction what people bought from market.
## Member_number Date itemDescription
## 1 1808 21-07-2015 tropical fruit
## 2 2552 05-01-2015 whole milk
## 3 2300 19-09-2015 pip fruit
## 4 1187 12-12-2015 other vegetables
## 5 3037 01-02-2015 whole milk
## 6 4941 14-02-2015 rolls/buns
## [1] 0
after the checking na values, converting member number to numeric (1) and Convert item description to categorical format(2)
sorted <- x[order(x$Member_number),] #1
sorted$Member_number <- as.numeric(sorted$Member_number) #2
str(sorted)## 'data.frame': 38765 obs. of 3 variables:
## $ Member_number : num 1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
## $ Date : chr "27-05-2015" "24-07-2015" "15-03-2015" "25-11-2015" ...
## $ itemDescription: chr "soda" "canned beer" "sausage" "sausage" ...
Before the Convert CSV file to Basket Format we should group all items bought together by the same customer(1) on the same date and remove member and date(2)
itemList <- ddply(sorted, c("Member_number","Date"), function(df1)paste(df1$itemDescription,collapse = ",")) #1
head(itemList)## Member_number Date V1
## 1 1000 15-03-2015 sausage,whole milk,semi-finished bread,yogurt
## 2 1000 24-06-2014 whole milk,pastry,salty snack
## 3 1000 24-07-2015 canned beer,misc. beverages
## 4 1000 25-11-2015 sausage,hygiene articles
## 5 1000 27-05-2015 soda,pickled vegetables
## 6 1001 02-05-2015 frankfurter,curd
itemList$Member_number <- NULL
itemList$Date <- NULL
colnames(itemList) <- c("itemList")
write.csv(itemList,"ItemList.csv", quote = FALSE, row.names = TRUE)#2
head(itemList)## itemList
## 1 sausage,whole milk,semi-finished bread,yogurt
## 2 whole milk,pastry,salty snack
## 3 canned beer,misc. beverages
## 4 sausage,hygiene articles
## 5 soda,pickled vegetables
## 6 frankfurter,curd
“An association rule has two parts: an antecedent (if) and a consequent (then). An antecedent is an item found within the data. A consequent is an item found in combination with the antecedent. Association rules are created by searching data for frequent if-then patterns and using the criteria support and confidence to identify the most important relationships. Support is an indication of how frequently the items appear in the data. Confidence indicates the number of times the if-then statements are found true. A third metric, called lift, can be used to compare confidence with expected confidence, or how many times an if-then statement is expected to be found true. Association rules are calculated from itemsets, which are made up of two or more items. If rules are built from analyzing all the possible itemsets, there could be so many rules that the rules hold little meaning. With that, association rules are typically created from rules well-represented in data.”(https://searchbusinessanalytics.techtarget.com/definition/association-rules-in-data-mining )
burhansbasket = read.transactions(file="ItemList.csv", rm.duplicates= TRUE, format="basket",sep=",",cols=1);## distribution of transactions with duplicates:
## items
## 1 2 3 4
## 662 39 5 1
## transactions in sparse format with
## 14964 transactions (rows) and
## 168 items (columns)
in the above we can see there are 14694 transiction and 168 items. now i will remove quotes from transaction for using Apriori algorithm.
basket_rules1 <- apriori(burhansbasket, parameter = list(minlen=2, sup = 0.001, conf = 0.05, target="rules"))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.05 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.01s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [450 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 450 rules
##
## rule length distribution (lhs + rhs):sizes
## 2 3
## 423 27
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 2.00 2.00 2.06 2.00 3.00
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001002 Min. :0.05000 Min. :0.005346 Min. :0.5195
## 1st Qu.:0.001270 1st Qu.:0.06397 1st Qu.:0.015972 1st Qu.:0.7673
## Median :0.001938 Median :0.08108 Median :0.023590 Median :0.8350
## Mean :0.002760 Mean :0.08759 Mean :0.033723 Mean :0.8859
## 3rd Qu.:0.003341 3rd Qu.:0.10482 3rd Qu.:0.043705 3rd Qu.:0.9601
## Max. :0.014836 Max. :0.25581 Max. :0.157912 Max. :2.1831
## count
## Min. : 15.0
## 1st Qu.: 19.0
## Median : 29.0
## Mean : 41.3
## 3rd Qu.: 50.0
## Max. :222.0
##
## mining info:
## data ntransactions support confidence
## burhansbasket 14964 0.001 0.05
in here we found 450 rules in among the our customers transaction so its a little much for trusting resulsts there for we should Changing hyperparameters
basket_rules2 <- apriori(burhansbasket, parameter = list(minlen=3, sup = 0.001, conf = 0.01, target="rules"))## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.01 0.1 1 none FALSE TRUE 5 0.001 3
## maxlen target ext
## 10 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 14
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[168 item(s), 14964 transaction(s)] done [0.00s].
## sorting and recoding items ... [149 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 done [0.00s].
## writing ... [27 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 27 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 27
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001002 Min. :0.07177 Min. :0.005346 Min. :0.7054
## 1st Qu.:0.001136 1st Qu.:0.08908 1st Qu.:0.008520 1st Qu.:0.7868
## Median :0.001136 Median :0.11724 Median :0.010559 Median :1.0825
## Mean :0.001181 Mean :0.12181 Mean :0.010564 Mean :1.0919
## 3rd Qu.:0.001203 3rd Qu.:0.13612 3rd Qu.:0.012797 3rd Qu.:1.1915
## Max. :0.001470 Max. :0.25581 Max. :0.014836 Max. :2.1831
## count
## Min. :15.00
## 1st Qu.:17.00
## Median :17.00
## Mean :17.67
## 3rd Qu.:18.00
## Max. :22.00
##
## mining info:
## data ntransactions support confidence
## burhansbasket 14964 0.001 0.01
Now there are 27 rules so maybe we can decide from this transaction whatWhat did he take with the one he bought at the grocery store. To be sure we Visualizing the Association Rules before the making decisions.
## lhs rhs support confidence
## [1] {sausage,yogurt} => {whole milk} 0.001470195 0.25581395
## [2] {sausage,whole milk} => {yogurt} 0.001470195 0.16417910
## [3] {whole milk,yogurt} => {sausage} 0.001470195 0.13173653
## [4] {sausage,soda} => {whole milk} 0.001069233 0.17977528
## [5] {sausage,whole milk} => {soda} 0.001069233 0.11940299
## [6] {soda,whole milk} => {sausage} 0.001069233 0.09195402
## [7] {rolls/buns,sausage} => {whole milk} 0.001136060 0.21250000
## [8] {sausage,whole milk} => {rolls/buns} 0.001136060 0.12686567
## [9] {rolls/buns,whole milk} => {sausage} 0.001136060 0.08133971
## [10] {rolls/buns,yogurt} => {whole milk} 0.001336541 0.17094017
## [11] {whole milk,yogurt} => {rolls/buns} 0.001336541 0.11976048
## [12] {rolls/buns,whole milk} => {yogurt} 0.001336541 0.09569378
## [13] {other vegetables,yogurt} => {whole milk} 0.001136060 0.14049587
## [14] {whole milk,yogurt} => {other vegetables} 0.001136060 0.10179641
## [15] {other vegetables,whole milk} => {yogurt} 0.001136060 0.07657658
## [16] {rolls/buns,soda} => {other vegetables} 0.001136060 0.14049587
## [17] {other vegetables,soda} => {rolls/buns} 0.001136060 0.11724138
## [18] {other vegetables,rolls/buns} => {soda} 0.001136060 0.10759494
## [19] {rolls/buns,soda} => {whole milk} 0.001002406 0.12396694
## [20] {soda,whole milk} => {rolls/buns} 0.001002406 0.08620690
## coverage lift count
## [1] 0.005747126 1.6199746 22
## [2] 0.008954825 1.9118880 22
## [3] 0.011160118 2.1830624 22
## [4] 0.005947608 1.1384500 16
## [5] 0.008954825 1.2296946 16
## [6] 0.011627907 1.5238095 16
## [7] 0.005346164 1.3456835 17
## [8] 0.008954825 1.1533523 17
## [9] 0.013966854 1.3479152 17
## [10] 0.007818765 1.0825005 20
## [11] 0.011160118 1.0887581 20
## [12] 0.013966854 1.1143671 20
## [13] 0.008086073 0.8897081 17
## [14] 0.011160118 0.8337610 17
## [15] 0.014835605 0.8917447 17
## [16] 0.008086073 1.1507281 17
## [17] 0.009689922 1.0658566 17
## [18] 0.010558674 1.1080872 17
## [19] 0.008086073 0.7850365 15
## [20] 0.011627907 0.7837181 15
in the above you can see rules.
Market Basket Analysis is effectively implemented by retailers in particular to develop marketing strategies by analyzing customer purchasing habits.Association rule inn algorithms, such as Apriori, are very useful for finding simple associations between data elements.In our project we also use Apriori and as a result of project we can say There are rules that make sense. For example, “sausage”, “yoghurt”, and “whole milk” seem to have strong connection.if we desig an apllication for market when someone would buy sausage and yoghur we can fix them buying milks. In the real life its very important project for datascientist.