Objective: As a Data Scientist at FDMart Grocery after analyzing FDMart transaction database to identify interesting patterns from the database.
Overview: The dataset provided by FDMart grocery has 106 items that are on sale and total of 64808 transactions. The most frequent item found in the grocery shop is fresh vegetables which is present in 30% of the total transaction available in the dataset. Second most frequent items are fresh fruit and some of the equally frequent items are cheese, soup and dried fruits etc. The following graph reflects the that FDMart customers buy fresh items and healthy food like cheese dried fruit, juices more frequently as compare to bottled or canned items.
# Load package arules
library(arules)
library(arulesViz)
library(grid)
# List datasets in package
#data()
#load dataset
transactions <- read.transactions("transactions.txt",format="single",sep=",",cols=c(1,2))
class(transactions)
[1] "transactions"
attr(,"package")
[1] "arules"
# summary showing basic statistics of the data set
summary(transactions)
transactions as itemMatrix in sparse format with
64809 rows (elements/itemsets/transactions) and
107 columns (items) and a density of 0.05353055
most frequent items:
Fresh Vegetables Fresh Fruit Cheese Soup Dried Fruit (Other)
20001 12641 9380 8209 8140 312840
element (itemset/transaction) length distribution:
sizes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
4490 8628 8522 10010 8344 9013 6075 2247 997 1024 999 672 436 249 235 226 149
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
96 80 94 91 77 85 91 92 123 162 207 226 216 174 152 124 115
35 36 37 38 39 40 41 42 43 44
79 63 62 28 26 14 8 6 1 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 5.000 5.728 6.000 44.000
includes extended item information - examples:
labels
1 Acetominifen
2 Anchovies
3 Aspirin
includes extended transaction information - examples:
transactionID
1 1
2 10
3 100
# plot frequencies of frequent items in the dataset
itemFrequencyPlot(transactions, support=0.1, cex.names=0.8)
Item frequency plot-With a ( minimum support .01 and .2 confidence )
On an average, each itemset or basket contains 5 to 6 items. In other words, basket having less than 5 items is more frequent as compare to baskets having more than 15 items. Buyers generally comes to purchase fewer items from the shop. Support being set to .01 means that plot only includes item set having more than 1 repetition in each 100 transactions. Anything less than that is ignored for the study. Support shows the frequency of the patterns in the rule; it is the percentage of transactions that contain both A and B, i.e. Support = Probability (A and B) Support = (# of transactions involving A and B) / (total number of transactions).
Confidence is the strength of implication of a rule; it is the percentage of transactions that contain B if they contain A, i.e. Confidence = Probability (B if A) = P(B/A) Confidence = (# of transactions involving A and B) / (total number of transactions that have A).
Correlation analysis The lift score . Lift = 1 ??? A and B are independent . Lift > 1 ??? A and B are positively correlated . Lift < 1 ??? A and B are negatively correlated Running the association rule mining with Apriori algorithm ( support=0.01,confidence=0.2) resulted in 9224 rules with a mean of 3.484 items in an item set and maximum of 5 items , item set . Plot of all 9224 rule (support= 0.01, confidence= 0.2)
# Mine association rules using Apriori algorithm implemented in arules.
rules <- apriori(transactions, parameter = list(support= 0.01 , confidence= 0.2))
summary of rules:
#summary of rules
summary(rules)
inspect top 5 rules by highest lift
# Inspect rules
#inspect(rules)
#inspect top 5 rules by highest lift
inspect(head(sort(rules, by ="lift"),5))
Above given item set are picked from the plot that reflects some very strong correlation between items like cooking oil ,rice and pots and pan(lift=28.18). Other than that chips ,deodorizer ,pancake mix and frozen chicken has a strong correlation with shrimp. I other words, people who buy cooking oil and rice are 75 % likely to buy pots and pans . Also buyers who buy chips and pancake mix are 75 % likely to buy shrimp. These rules and their subset provide some very interesting patterns discussed in the next section.
# Visualization of rules
#Plotting rules
plot(rules)
# Interactive plots for rules
sel <- plot(rules, measure=c("support", "lift"), shading="confidence", interactive=FALSE)
# Two key plot
plot(rules , shading="order", control=list(main="two-key plot"))
# 1.Purchase pattern related to beverages (Wine , Beer )
#Find subset of rules that has Wine on the right hand side
RulesBev1 <- subset(rules, subset = rhs %ain% "Wine")
summary(RulesBev1)
set of 16 rules
rule length distribution (lhs + rhs):sizes
2 3
9 7
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 2.000 2.438 3.000 3.000
summary of quality measures:
support confidence lift
Min. :0.01015 Min. :0.2081 Min. :2.032
1st Qu.:0.01030 1st Qu.:0.2407 1st Qu.:2.349
Median :0.01150 Median :0.2982 Median :2.910
Mean :0.01228 Mean :0.3740 Mean :3.650
3rd Qu.:0.01330 3rd Qu.:0.4790 3rd Qu.:4.675
Max. :0.01719 Max. :0.6748 Max. :6.586
mining info:
data ntransactions support confidence
transactions 64809 0.01 0.2
inspect(RulesBev1)
lhs rhs support confidence lift
[1] {Spices} => {Wine} 0.01015291 0.2222973 2.169709
[2] {Candles} => {Wine} 0.01181935 0.4633999 4.522964
[3] {Fresh Chicken} => {Wine} 0.01155704 0.4434577 4.328320
[4] {Sauces} => {Wine} 0.01641747 0.5256917 5.130957
[5] {Crackers} => {Wine} 0.01144903 0.2620982 2.558181
[6] {Gum} => {Wine} 0.01078554 0.2676110 2.611988
[7] {Sponges} => {Wine} 0.01038436 0.2836072 2.768118
[8] {Cooking Oil} => {Wine} 0.01718897 0.2388508 2.331277
[9] {Rice} => {Wine} 0.01276057 0.2139715 2.088446
[10] {Candles,Fresh Vegetables} => {Wine} 0.01029178 0.6233645 6.084281
[11] {Fresh Chicken,Fresh Vegetables} => {Wine} 0.01023006 0.6356663 6.204352
[12] {Fresh Vegetables,Sauces} => {Wine} 0.01492077 0.6748081 6.586391
[13] {Cooking Oil,Fresh Vegetables} => {Wine} 0.01272971 0.3668297 3.580402
[14] {Fresh Vegetables,Rice} => {Wine} 0.01030721 0.3127341 3.052407
[15] {Fresh Vegetables,Juice} => {Wine} 0.01024549 0.2412791 2.354978
[16] {Fresh Fruit,Fresh Vegetables} => {Wine} 0.01530652 0.2081410 2.031538
#Plotting RulesBev1
plot(RulesBev1, method="matrix", measure="lift", control=list(reorder=TRUE))
Itemsets in Antecedent (LHS)
[1] "{Fresh Fruit,Fresh Vegetables}" "{Rice}"
[3] "{Spices}" "{Cooking Oil}"
[5] "{Fresh Vegetables,Juice}" "{Crackers}"
[7] "{Gum}" "{Sponges}"
[9] "{Fresh Vegetables,Rice}" "{Cooking Oil,Fresh Vegetables}"
[11] "{Fresh Chicken}" "{Candles}"
[13] "{Sauces}" "{Candles,Fresh Vegetables}"
[15] "{Fresh Chicken,Fresh Vegetables}" "{Fresh Vegetables,Sauces}"
Itemsets in Consequent (RHS)
[1] "{Wine}"
Market Basket Analysis: 1.) Purchase patterns related to beverages (Wine, Beer etc.)
a.) In the matrix plot with antecedents and consequents based on 16 rules ,we found that with fresh items like fresh vegetables, candles ,sauces, deodorizer ,wine is found to be most consequent item.
b.) Mining the rules for Wine on the rhs, resulted that there is hardly any correlation between wine and Beer . Out of 16 rules, results reflected wine on the RHS but not a single item set with Beer on the lhs. In other words people who are buying Beer rarely buy Wine. Wine is combined with items like sauces ,fresh vegetables and candles. Moreover, people who eat healthy fresh food like fresh chicken and fresh vegetable are more likely to buy wine. These buyers are mostly who cook food on daily basis
#Find subset of rules that has Wine and Beer in the left hand side.
RulesBev2 <- subset(rules, subset = lhs %ain% "Wine"|lhs %ain% "Beer" )
summary(RulesBev2)
set of 36 rules
rule length distribution (lhs + rhs):sizes
2 3
22 14
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 2.000 2.389 3.000 3.000
summary of quality measures:
support confidence lift
Min. :0.01023 Min. :0.2033 Min. : 1.170
1st Qu.:0.01037 1st Qu.:0.2568 1st Qu.: 2.355
Median :0.01285 Median :0.2778 Median : 2.930
Mean :0.01389 Mean :0.3766 Mean : 3.663
3rd Qu.:0.01492 3rd Qu.:0.3765 3rd Qu.: 4.110
Max. :0.03989 Max. :0.9088 Max. :11.978
mining info:
data ntransactions support confidence
transactions 64809 0.01 0.2
inspect(RulesBev2)
lhs rhs support confidence lift
[1] {Beer} => {Gum} 0.01370180 0.2723091 6.756539
[2] {Beer} => {Sour Cream} 0.01100156 0.2186446 4.609674
[3] {Beer} => {Pizza} 0.01023006 0.2033119 2.373705
[4] {Beer} => {Deodorizers} 0.01293030 0.2569764 6.200440
[5] {Beer} => {Cottage Cheese} 0.01038436 0.2063784 3.723602
[6] {Beer} => {Jam} 0.01104785 0.2195646 3.256238
[7] {Beer} => {Jelly} 0.01083183 0.2152714 3.000973
[8] {Beer} => {Frozen Chicken} 0.01365551 0.2713891 4.035902
[9] {Beer} => {Chips} 0.01607801 0.3195339 3.304399
[10] {Beer} => {Eggs} 0.01433443 0.2848819 3.144229
[11] {Beer} => {Pancake Mix} 0.01277600 0.2539098 4.709686
[12] {Beer} => {Waffles} 0.01404126 0.2790555 3.402692
[13] {Beer} => {Paper Wipes} 0.01181935 0.2348973 2.132137
[14] {Beer} => {Canned Vegetables} 0.01530652 0.3042012 2.915554
[15] {Beer} => {Cereal} 0.01382524 0.2747623 3.068069
[16] {Beer} => {Sliced Bread} 0.01399497 0.2781355 2.655914
[17] {Beer} => {Juice} 0.01396411 0.2775222 2.540746
[18] {Beer} => {Cheese} 0.01533738 0.3048145 2.106047
[19] {Beer} => {Fresh Fruit} 0.01237482 0.2459368 1.260891
[20] {Beer} => {Fresh Vegetables} 0.01816106 0.3609322 1.169524
[21] {Wine} => {Fresh Fruit} 0.02632350 0.2569277 1.317240
[22] {Wine} => {Fresh Vegetables} 0.03988644 0.3893072 1.261468
[23] {Candles,Wine} => {Fresh Vegetables} 0.01029178 0.8707572 2.821504
[24] {Fresh Vegetables,Wine} => {Candles} 0.01029178 0.2580271 10.116441
[25] {Fresh Chicken,Wine} => {Fresh Vegetables} 0.01023006 0.8851802 2.868239
[26] {Fresh Vegetables,Wine} => {Fresh Chicken} 0.01023006 0.2564797 9.841440
[27] {Sauces,Wine} => {Fresh Vegetables} 0.01492077 0.9088346 2.944886
[28] {Fresh Vegetables,Wine} => {Sauces} 0.01492077 0.3740812 11.978177
[29] {Cooking Oil,Wine} => {Fresh Vegetables} 0.01272971 0.7405745 2.399675
[30] {Fresh Vegetables,Wine} => {Cooking Oil} 0.01272971 0.3191489 4.434761
[31] {Rice,Wine} => {Fresh Vegetables} 0.01030721 0.8077388 2.617306
[32] {Fresh Vegetables,Wine} => {Rice} 0.01030721 0.2584139 4.333130
[33] {Juice,Wine} => {Fresh Vegetables} 0.01024549 0.7272727 2.356573
[34] {Fresh Vegetables,Wine} => {Juice} 0.01024549 0.2568665 2.351641
[35] {Fresh Fruit,Wine} => {Fresh Vegetables} 0.01530652 0.5814771 1.884153
[36] {Fresh Vegetables,Wine} => {Fresh Fruit} 0.01530652 0.3837524 1.967456
c.) Further creating sub rules to get wine and Beer on the lhs we got 36 rules in which most baskets were with 2 or 3 items in it. Beer and wine were not present in a single item set. Note: Beer is mostly purchase with gums, pizza , frozen food items, eggs and chips where as wine is frequently purchased with fresh vegetables, fresh chicken and candles. But these two items are not found together in any of the item set.
#generating rules for beer on RHS from transactional data using apriori algorithm
beerRule<-apriori(data=transactions, parameter=list(supp=0.01,conf = 0.15,minlen=2),
appearance = list(default="lhs",rhs="Beer"),
control = list(verbose=F))
#Sorting Beerrule by confidence in descending order
rules1<-sort(beerRule, decreasing=TRUE,by="confidence")
summary(rules1)
set of 13 rules
rule length distribution (lhs + rhs):sizes
2
13
Min. 1st Qu. Median Mean 3rd Qu. Max.
2 2 2 2 2 2
summary of quality measures:
support confidence lift
Min. :0.01006 Min. :0.1510 Min. :3.001
1st Qu.:0.01100 1st Qu.:0.1638 1st Qu.:3.256
Median :0.01293 Median :0.1712 Median :3.403
Mean :0.01267 Mean :0.2035 Mean :4.043
3rd Qu.:0.01383 3rd Qu.:0.2319 3rd Qu.:4.610
Max. :0.01608 Max. :0.3400 Max. :6.757
mining info:
data ntransactions support confidence
transactions 64809 0.01 0.15
inspect(rules1)
lhs rhs support confidence lift
[1] {Gum} => {Beer} 0.01370180 0.3399694 6.756539
[2] {Deodorizers} => {Beer} 0.01293030 0.3119881 6.200440
[3] {Pancake Mix} => {Beer} 0.01277600 0.2369777 4.709686
[4] {Sour Cream} => {Beer} 0.01100156 0.2319453 4.609674
[5] {Frozen Chicken} => {Beer} 0.01365551 0.2030748 4.035902
[6] {Cottage Cheese} => {Beer} 0.01038436 0.1873608 3.723602
[7] {Waffles} => {Beer} 0.01404126 0.1712135 3.402692
[8] {Rice} => {Beer} 0.01006033 0.1686934 3.352607
[9] {Chips} => {Beer} 0.01607801 0.1662678 3.304399
[10] {Jam} => {Beer} 0.01104785 0.1638444 3.256238
[11] {Eggs} => {Beer} 0.01433443 0.1582084 3.144229
[12] {Cereal} => {Beer} 0.01382524 0.1543763 3.068069
[13] {Jelly} => {Beer} 0.01083183 0.1510002 3.000973
Some Visualization for above subrules
# Visualization for 1st question subrules
# plot for subrules
plot(RulesBev1,method="graph",interactive=FALSE,shading=NA)
plot(RulesBev2,method="graph",interactive=FALSE,shading=NA)
plot(beerRule,method="graph",interactive=FALSE,shading=NA)
When finding a rule for wine or beer on the left hand side (means finding basket in which people who buy wine or beer are most likely to buy what other items).The search resulted in wine and beer separately on the lhs of the item set which depicts that wine and beer doesn’t go together. Results/Findings: 1.) Wine and Beer has no correlation. These two items very rarely go together. 2.) Beer is mostly purchase with gums, pizza , frozen food items, eggs and chips where as wine is frequently purchased with fresh vegetables, fresh chicken and candles .People buy it with items used for making dinner and full meals. 3.) There is positive relation between candles and wine too. The person who buy candles are 62% likely to buy wine from that store. 4.) Beer is purchased mostly in small baskets where there is 2 or less items in a basket.
# 2.Pattern with respect to canned Vs fresh
#Subrules for Fresh Vegetables on the rhs
FreshRules <- subset(rules, subset = rhs %pin% "Fresh Vegetables")
summary(FreshRules)
set of 864 rules
rule length distribution (lhs + rhs):sizes
1 2 3 4 5
1 74 296 460 33
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 4.000 3.521 4.000 5.000
summary of quality measures:
support confidence lift
Min. :0.01001 Min. :0.2198 Min. :0.7122
1st Qu.:0.01102 1st Qu.:0.6880 1st Qu.:2.2292
Median :0.01129 Median :0.8294 Median :2.6874
Mean :0.01323 Mean :0.7751 Mean :2.5116
3rd Qu.:0.01269 3rd Qu.:0.9089 3rd Qu.:2.9450
Max. :0.30861 Max. :0.9742 Max. :3.1566
mining info:
data ntransactions support confidence
transactions 64809 0.01 0.2
inspect(FreshRules[1:20])
lhs rhs support confidence lift
[1] {} => {Fresh Vegetables} 0.30861454 0.3086145 1.0000000
[2] {Canned Fruit} => {Fresh Vegetables} 0.01418013 0.4240886 1.3741692
[3] {Deli Salads} => {Fresh Vegetables} 0.01220509 0.2878457 0.9327030
[4] {Personal Hygiene} => {Fresh Vegetables} 0.01692666 0.3269747 1.0594921
[5] {Plastic Utensils} => {Fresh Vegetables} 0.01127930 0.2515485 0.8150896
[6] {Spices} => {Fresh Vegetables} 0.01904057 0.4168919 1.3508498
[7] {Popcorn} => {Fresh Vegetables} 0.01089355 0.2406271 0.7797012
[8] {Aspirin} => {Fresh Vegetables} 0.01185021 0.3221477 1.0438512
[9] {Candles} => {Fresh Vegetables} 0.01651005 0.6473079 2.0974641
[10] {Fresh Chicken} => {Fresh Vegetables} 0.01609344 0.6175252 2.0009594
[11] {Pots and Pans} => {Fresh Vegetables} 0.01660263 0.6237681 2.0211883
[12] {Tofu} => {Fresh Vegetables} 0.01154161 0.5006693 1.6223129
[13] {Fashion Magazines} => {Fresh Vegetables} 0.01274514 0.5481088 1.7760304
[14] {Popsicles} => {Fresh Vegetables} 0.01555340 0.2593927 0.8405070
[15] {Hard Candy} => {Fresh Vegetables} 0.01538367 0.4253413 1.3782283
[16] {Sauces} => {Fresh Vegetables} 0.02211113 0.7080040 2.2941367
[17] {Oysters} => {Fresh Vegetables} 0.01018377 0.4153556 1.3458717
[18] {Dips} => {Fresh Vegetables} 0.01606258 0.2440225 0.7907032
[19] {Sugar} => {Fresh Vegetables} 0.01968862 0.5085692 1.6479105
[20] {Tools} => {Fresh Vegetables} 0.01612430 0.4477292 1.4507716
2.) Canned vs Fresh a.) Another very important category to item is canned and fresh food. Which has mainly fresh vegetables ,fresh fruits ,canned vegetables and canned fruits. Looking for more item sets having baskets with fresh vegetables and fresh fruits, we found with 864 item sets having fresh vegetables and 133 itemset with fresh fruits on the right hand side of the itemset .
# Subrules for Fresh Fruit on the rhs
FreshRules1 <- subset(rules, subset = rhs %pin% "Fresh Fruit")
summary(FreshRules1)
set of 133 rules
rule length distribution (lhs + rhs):sizes
2 3 4
49 83 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.000 3.000 2.639 3.000 4.000
summary of quality measures:
support confidence lift
Min. :0.01003 Min. :0.2009 Min. :1.030
1st Qu.:0.01086 1st Qu.:0.3236 1st Qu.:1.659
Median :0.01202 Median :0.4008 Median :2.055
Mean :0.01463 Mean :0.4431 Mean :2.272
3rd Qu.:0.01629 3rd Qu.:0.5385 3rd Qu.:2.761
Max. :0.07354 Max. :0.8639 Max. :4.429
mining info:
data ntransactions support confidence
transactions 64809 0.01 0.2
inspect(FreshRules1[1:20])
lhs rhs support confidence lift
[1] {Spices} => {Fresh Fruit} 0.01405669 0.3077703 1.577904
[2] {Candles} => {Fresh Fruit} 0.01245197 0.4882033 2.502964
[3] {Fresh Chicken} => {Fresh Fruit} 0.01265256 0.4854944 2.489076
[4] {Fashion Magazines} => {Fresh Fruit} 0.01016834 0.4372926 2.241951
[5] {Hard Candy} => {Fresh Fruit} 0.01158790 0.3203925 1.642617
[6] {Sauces} => {Fresh Fruit} 0.01228224 0.3932806 2.016306
[7] {Tools} => {Fresh Fruit} 0.01092441 0.3033419 1.555200
[8] {Pasta} => {Fresh Fruit} 0.02163280 0.3554767 1.822489
[9] {Bologna} => {Fresh Fruit} 0.01041522 0.2085909 1.069422
[10] {TV Dinner} => {Fresh Fruit} 0.01604715 0.3209877 1.645668
[11] {Conditioner} => {Fresh Fruit} 0.01526023 0.5079610 2.604259
[12] {Mouthwash} => {Fresh Fruit} 0.01632489 0.3866959 1.982547
[13] {Coffee} => {Fresh Fruit} 0.01444244 0.2355310 1.207541
[14] {Shrimp} => {Fresh Fruit} 0.01077011 0.3541350 1.815611
[15] {Lightbulbs} => {Fresh Fruit} 0.01853138 0.2870459 1.471652
[16] {Peanut Butter} => {Fresh Fruit} 0.01317718 0.2475362 1.269091
[17] {Cleaners} => {Fresh Fruit} 0.01785246 0.3346833 1.715884
[18] {Cooking Oil} => {Fresh Fruit} 0.01502878 0.2088336 1.070667
[19] {Yogurt} => {Fresh Fruit} 0.01706553 0.3645353 1.868932
[20] {Muffins} => {Fresh Fruit} 0.01629403 0.2158626 1.106704
Results/Findings: 1.) Fresh Fruit and fresh vegetables are also positively correlated and purchased with people who buy these two items also buy items like pasta, rice,juice cheese. Items on the LHS Rhs support Confidence Lift {Fresh Fruit,Fresh Vegetables,Pasta} => {Rice} 0.01047710 0.6935649 11.629638 {Fresh Fruit,Fresh Vegetables,Rice} => {Pasta} 0.01047710 0.5843373 9.601860
#subrule for both Fresh Fruit and Fresh Vegetable on the lhs
FreshRules2 <- subset(rules, subset = lhs %ain% c("Fresh Fruit", "Fresh Vegetables"))
summary(FreshRules2)
set of 7 rules
rule length distribution (lhs + rhs):sizes
3 4
5 2
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 3.000 3.000 3.286 3.500 4.000
summary of quality measures:
support confidence lift
Min. :0.01048 Min. :0.2048 Min. : 1.506
1st Qu.:0.01277 1st Qu.:0.2068 1st Qu.: 1.953
Median :0.01511 Median :0.2180 Median : 3.375
Mean :0.01434 Mean :0.3369 Mean : 4.873
3rd Qu.:0.01567 3rd Qu.:0.4141 3rd Qu.: 6.845
Max. :0.01793 Max. :0.6936 Max. :11.630
mining info:
data ntransactions support confidence
transactions 64809 0.01 0.2
inspect(FreshRules2)
lhs rhs support confidence lift
[1] {Fresh Fruit,Fresh Vegetables} => {Pasta} 0.01510593 0.2054133 3.375414
[2] {Fresh Fruit,Fresh Vegetables} => {Wine} 0.01530652 0.2081410 2.031538
[3] {Fresh Fruit,Fresh Vegetables} => {Rice} 0.01792961 0.2438103 4.088254
[4] {Fresh Fruit,Fresh Vegetables} => {Juice} 0.01505964 0.2047839 1.874818
[5] {Fresh Fruit,Fresh Vegetables} => {Cheese} 0.01603172 0.2180025 1.506239
[6] {Fresh Fruit,Fresh Vegetables,Pasta} => {Rice} 0.01047694 0.6935649 11.629818
[7] {Fresh Fruit,Fresh Vegetables,Rice} => {Pasta} 0.01047694 0.5843373 9.602008
2.) Canned fruit are not a frequent item and buyers sometimes buy them with fresh vegetables but chances are very less. Sub Rules created for fresh vegetable and canned vegetable on the lhs resulted in 203 itemset having canned vegetables and fresh vegetables together with deli meats ,shrimp, rice and pasta.
#Subrule for fresh Vegetable and Canned Vegetables on lhs.
cannedRules <- subset(rules, subset = lhs %ain% c("Fresh Vegetables", "Canned Vegetables"))
summary(cannedRules)
set of 203 rules
rule length distribution (lhs + rhs):sizes
3 4
21 182
Min. 1st Qu. Median Mean 3rd Qu. Max.
3.000 4.000 4.000 3.897 4.000 4.000
summary of quality measures:
support confidence lift
Min. :0.01021 Min. :0.2571 Min. : 1.836
1st Qu.:0.01095 1st Qu.:0.7353 1st Qu.: 7.094
Median :0.01111 Median :0.7533 Median :10.252
Mean :0.01134 Mean :0.7122 Mean :10.160
3rd Qu.:0.01128 3rd Qu.:0.7705 3rd Qu.:12.981
Max. :0.01649 Max. :0.8193 Max. :19.019
mining info:
data ntransactions support confidence
transactions 64809 0.01 0.2
inspect(head(cannedRules))
lhs rhs support confidence lift
[1] {Canned Vegetables,Fresh Vegetables} => {Shrimp} 0.01027635 0.2586408 8.504439
[2] {Canned Vegetables,Fresh Vegetables} => {Peanut Butter} 0.01095527 0.2757282 5.179613
[3] {Canned Vegetables,Fresh Vegetables} => {Sour Cream} 0.01451959 0.3654369 7.704489
[4] {Canned Vegetables,Fresh Vegetables} => {Shampoo} 0.01021463 0.2570874 4.228826
[5] {Canned Vegetables,Fresh Vegetables} => {Rice} 0.01425728 0.3588350 6.017008
[6] {Canned Vegetables,Fresh Vegetables} => {Deli Meats} 0.01063124 0.2675728 3.576227
#visualization for 2nd question
#plotting first 20 subrules with high lift for fresh vegetables on rhs
subrules2 <- head(sort(FreshRules, by="lift"), 20)
plot(subrules2, method="graph")
#plotting subrule for fresh fruit on rhs
plot(FreshRules1,method="graph",interactive=FALSE,shading=NA)
#plotting subrule for fresh fruit and fresh vegetables on lhs
#plot(FreshRules2,method="graph",interactive=False,shading=NA)
#Plot for comparision of fresh vegetables and canned vegetables
subrules3 <- head(sort(cannedRules, by="lift"), 10)
plot(subrules3,method="graph",interactive=FALSE,shading=NA)
3.) Fresh fruits and fresh vegetables have a strong positive correlation and people buy them very frequently with pasta and rice. 4.) Canned vegetables and fresh vegetables are positively correlated. People buy these mostly with those items that are used for cooking meals for dinner and lunch e: g oil, pasta, rice, cheese, jelly, sour cream and wine. 5.) Canned fruits are not purchased frequently and its sale is independent of fresh fruits.
3.) Small vs large transactions
# 3. Small and Large transaction
#Subrule for small baskets with item less than or equal to 2
rulesSmallSize <- subset(rules, subset = size(rules) <=2 )
#summary for ruleSmallSize
summary(rulesSmallSize)
set of 787 rules
rule length distribution (lhs + rhs):sizes
1 2
1 786
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 2.000 2.000 1.999 2.000 2.000
summary of quality measures:
support confidence lift
Min. :0.01001 Min. :0.2001 Min. : 0.7122
1st Qu.:0.01235 1st Qu.:0.2341 1st Qu.: 2.2477
Median :0.01575 Median :0.2775 Median : 3.1529
Mean :0.01782 Mean :0.3033 Mean : 3.6580
3rd Qu.:0.02092 3rd Qu.:0.3527 3rd Qu.: 4.7097
Max. :0.30861 Max. :0.7080 Max. :15.5760
mining info:
data ntransactions support confidence
transactions 64809 0.01 0.2
inspect(head(rulesSmallSize))
lhs rhs support confidence lift
[1] {} => {Fresh Vegetables} 0.30861454 0.3086145 1.0000000
[2] {Canned Fruit} => {Fresh Vegetables} 0.01418013 0.4240886 1.3741692
[3] {Deli Salads} => {Fresh Vegetables} 0.01220509 0.2878457 0.9327030
[4] {Personal Hygiene} => {Fresh Vegetables} 0.01692666 0.3269747 1.0594921
[5] {Plastic Utensils} => {Fresh Vegetables} 0.01127930 0.2515485 0.8150896
[6] {Spices} => {Wine} 0.01015291 0.2222973 2.1697087
In the transaction dataset finding a small baskets having less than or equal to 2 items with frequent itemset were found to be 787. Few items having strong positive correlation are as follows. {Candles} => {Fresh Chicken} 0.01035366 0.4059286 15.5757381 {Fresh Chicken} => {Candles} 0.01035366 0.3972765 15.5757381 {Candles} => {Sauces} 0.01027651 0.4029038 12.9008845 {Sauces} => {Candles} 0.01027651 0.3290514 12.9008845
Some items that are negatively correlated are :
{French Fries} => {Fresh Vegetables} 0.01151092 0.2197996 0.7122032 {Donuts} => {Fresh Vegetables} 0.01265276 0.2388581 0.7739572
Results/ Findings : 1.) This reflects that people who purchase unhealthy and sugary food rarely buy fresh vegetables. Fresh vegetables are mostly purchased with sauces, candle and fresh chicken. Discussed in previous part.
#Subrule for Large baskets with item more than or equal to 5
rulesLargeSize <- subset(rules, subset = size(rules) >= 5 )
summary(rulesLargeSize)
set of 400 rules
rule length distribution (lhs + rhs):sizes
5
400
Min. 1st Qu. Median Mean 3rd Qu. Max.
5 5 5 5 5 5
summary of quality measures:
support confidence lift
Min. :0.01001 Min. :0.6788 Min. : 2.236
1st Qu.:0.01027 1st Qu.:0.7999 1st Qu.: 7.634
Median :0.01054 Median :0.8173 Median : 8.375
Mean :0.01063 Mean :0.8162 Mean : 9.841
3rd Qu.:0.01083 3rd Qu.:0.8345 3rd Qu.:12.220
Max. :0.01258 Max. :0.8884 Max. :19.659
mining info:
data ntransactions support confidence
transactions 64809 0.01 0.2
#inspect(rulesLargeSize)
inspect(head(sort(rulesLargeSize, by ="lift"),5))
lhs rhs support confidence lift
[1] {Cottage Cheese,
Fresh Vegetables,
Frozen Chicken,
Sliced Bread} => {Deodorizers} 0.01038436 0.8147700 19.65913
[2] {Fresh Vegetables,
Frozen Chicken,
Juice,
Sliced Bread} => {Deodorizers} 0.01058495 0.8108747 19.56514
[3] {Fresh Vegetables,
Frozen Chicken,
Pancake Mix,
Sliced Bread} => {Deodorizers} 0.01019920 0.8100490 19.54522
[4] {Cereal,
Fresh Vegetables,
Frozen Chicken,
Sliced Bread} => {Deodorizers} 0.01036893 0.8076923 19.48836
[5] {Frozen Chicken,
Juice,
Pancake Mix,
Sliced Bread} => {Deodorizers} 0.01018377 0.8068460 19.46794
2.) Itemset having more than or equal to 5 items in a basket are found to be 400.When a customer buy 5 or more items are found to have a positive correlation.
# Visualization for question 3
#plotting rulesSmallSize for small item basket
plot(rulesSmallSize, method="paracoord")
number of rows of result is not a multiple of vector length (arg 2)
# Interactive plot for rulesSmallSize
#sel <- plot(rulesSmallSize, measure=c("support", "lift"), shading="confidence", interactive=TRUE)
# plotting large itemset
plot(rulesLargeSize, method="paracoord")
#Interactice plot rulesLargeSize
#sel <- plot(rulesLargeSize, measure=c("support", "lift"), shading="confidence", interactive=TRUE)
3.) Large Itemset mostly contain items that are used to cook full meals, dinner and breakfast. Fresh vegetables, fresh chicken, juice and sliced bread are found to be positively correlated with deodorizer.
4.) Dairy (milk) Vs cereals.
# 4.One more intresting pattern:Milk and Cereal
# Subsets. find subset of rules that has Milk on the Rhs and Cereal on lhs
Rulesinterest1 <- subset(rules, subset = rhs %pin% "Milk" & lhs %ain% "Cereal")
#Summary of Rulesinterest1
summary(Rulesinterest1)
set of 24 rules
rule length distribution (lhs + rhs):sizes
2 3 4 5
1 7 15 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 4.000 3.667 4.000 5.000
summary of quality measures:
support confidence lift
Min. :0.01003 Min. :0.2190 Min. :2.451
1st Qu.:0.01141 1st Qu.:0.5485 1st Qu.:6.139
Median :0.01152 Median :0.6173 Median :6.910
Mean :0.01234 Mean :0.5899 Mean :6.603
3rd Qu.:0.01359 3rd Qu.:0.6768 3rd Qu.:7.576
Max. :0.01961 Max. :0.7088 Max. :7.934
mining info:
data ntransactions support confidence
transactions 64809 0.01 0.2
inspect(Rulesinterest1)
lhs rhs support confidence lift
[1] {Cereal} => {Milk} 0.01961147 0.2189869 2.451178
[2] {Cereal,Cottage Cheese} => {Milk} 0.01365551 0.5466337 6.118616
[3] {Cereal,Jam} => {Milk} 0.01394868 0.5854922 6.553569
[4] {Cereal,Jelly} => {Milk} 0.01371723 0.5746606 6.432328
[5] {Cereal,Waffles} => {Milk} 0.01356293 0.5369578 6.010311
[6] {Cereal,Sliced Bread} => {Milk} 0.01407212 0.4532803 5.073686
[7] {Cereal,Juice} => {Milk} 0.01396411 0.4408183 4.934196
[8] {Cereal,Fresh Vegetables} => {Milk} 0.01138731 0.3152499 3.528675
[9] {Cereal,Cottage Cheese,Jam} => {Milk} 0.01146446 0.6956929 7.787074
[10] {Cereal,Cottage Cheese,Jelly} => {Milk} 0.01126387 0.6932574 7.759813
[11] {Cereal,Cottage Cheese,Waffles} => {Milk} 0.01120215 0.6836158 7.651893
[12] {Cereal,Cottage Cheese,Sliced Bread} => {Milk} 0.01155704 0.6035455 6.755645
[13] {Cereal,Cottage Cheese,Juice} => {Milk} 0.01158790 0.6100731 6.828710
[14] {Cereal,Jam,Jelly} => {Milk} 0.01144903 0.6745455 7.550366
[15] {Cereal,Jam,Waffles} => {Milk} 0.01147989 0.6901670 7.725221
[16] {Cereal,Jam,Sliced Bread} => {Milk} 0.01177306 0.6758193 7.564624
[17] {Cereal,Jam,Juice} => {Milk} 0.01181935 0.6666667 7.462176
[18] {Cereal,Jelly,Waffles} => {Milk} 0.01114043 0.6798493 7.609733
[19] {Cereal,Jelly,Sliced Bread} => {Milk} 0.01141817 0.6654676 7.448755
[20] {Cereal,Jelly,Juice} => {Milk} 0.01147989 0.6549296 7.330800
[21] {Cereal,Sliced Bread,Waffles} => {Milk} 0.01140274 0.6158333 6.893185
[22] {Cereal,Juice,Waffles} => {Milk} 0.01144903 0.6188490 6.926941
[23] {Cereal,Juice,Sliced Bread} => {Milk} 0.01174220 0.5490620 6.145797
[24] {Cereal,Jam,Juice,Sliced Bread} => {Milk} 0.01002947 0.7088332 7.934157
Milk is one of the most frequent item in the item sets .Finding patterns of one more item that is very frequent with milk is cereals.Analysing 24 itemset having milk on the rhs and cereals on the left hand side it is found that both the item are positively correlated .Buyers who buy cereals with jelly jam sliced bread also buy milk and vice versa.
#Subsets. find subset of rules that has Milk on the lhs and Cereal on rhs
Rulesinterest2 <- subset(rules, subset = lhs %ain% "Milk" & rhs %ain% "Cereal")
summary(Rulesinterest2)
set of 24 rules
rule length distribution (lhs + rhs):sizes
2 3 4 5
1 7 15 1
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 3.000 4.000 3.667 4.000 5.000
summary of quality measures:
support confidence lift
Min. :0.01003 Min. :0.2195 Min. :2.451
1st Qu.:0.01141 1st Qu.:0.6986 1st Qu.:7.801
Median :0.01152 Median :0.8246 Median :9.208
Mean :0.01234 Mean :0.7369 Mean :8.228
3rd Qu.:0.01359 3rd Qu.:0.8365 3rd Qu.:9.340
Max. :0.01961 Max. :0.8519 Max. :9.513
mining info:
data ntransactions support confidence
transactions 64809 0.01 0.2
inspect(Rulesinterest2)
lhs rhs support confidence lift
[1] {Milk} => {Cereal} 0.01961147 0.2195164 2.451178
[2] {Cottage Cheese,Milk} => {Cereal} 0.01365551 0.7577055 8.460740
[3] {Jam,Milk} => {Cereal} 0.01394868 0.7062500 7.886174
[4] {Jelly,Milk} => {Cereal} 0.01371723 0.7186742 8.024906
[5] {Milk,Waffles} => {Cereal} 0.01356293 0.6756341 7.544309
[6] {Milk,Sliced Bread} => {Cereal} 0.01407212 0.5595092 6.247628
[7] {Juice,Milk} => {Cereal} 0.01396411 0.5441972 6.076650
[8] {Fresh Vegetables,Milk} => {Cereal} 0.01138731 0.3744292 4.180976
[9] {Cottage Cheese,Jam,Milk} => {Cereal} 0.01146446 0.8320269 9.290632
[10] {Cottage Cheese,Jelly,Milk} => {Cereal} 0.01126387 0.8381171 9.358637
[11] {Cottage Cheese,Milk,Waffles} => {Cereal} 0.01120215 0.8268793 9.233153
[12] {Cottage Cheese,Milk,Sliced Bread} => {Cereal} 0.01155704 0.8359375 9.334299
[13] {Cottage Cheese,Juice,Milk} => {Cereal} 0.01158790 0.8400447 9.380162
[14] {Jam,Jelly,Milk} => {Cereal} 0.01144903 0.8318386 9.288530
[15] {Jam,Milk,Waffles} => {Cereal} 0.01147989 0.8175824 9.129342
[16] {Jam,Milk,Sliced Bread} => {Cereal} 0.01177306 0.8293478 9.260717
[17] {Jam,Juice,Milk} => {Cereal} 0.01181935 0.8426843 9.409636
[18] {Jelly,Milk,Waffles} => {Cereal} 0.01114043 0.8223235 9.182281
[19] {Jelly,Milk,Sliced Bread} => {Cereal} 0.01141817 0.8390023 9.368521
[20] {Jelly,Juice,Milk} => {Cereal} 0.01147989 0.8406780 9.387233
[21] {Milk,Sliced Bread,Waffles} => {Cereal} 0.01140274 0.8211111 9.168744
[22] {Juice,Milk,Waffles} => {Cereal} 0.01144903 0.8281250 9.247063
[23] {Juice,Milk,Sliced Bread} => {Cereal} 0.01174220 0.6320598 7.057747
[24] {Jam,Juice,Milk,Sliced Bread} => {Cereal} 0.01002947 0.8519004 9.512545
#Visualization for question 4;
#Plot for Rulesinterest1 that has Milk on the rhs and cereal on the lhs
plot(Rulesinterest1, method="paracoord")
#Plot for Rulesinterest2 that has milk on the lhs and cereal on the rhs
plot(Rulesinterest2, method="graph")
Results/ Findings: 1.) People who buy cereal are very likely to buy milk. It is supported by more than 80 % confidence level. 2.) Buyers who buy milk also likely to buy cereal. 3.) Cereal and milk are positively correlated and mostly purchased together with other items that are used in making breakfast. Recommendations: Based on the findings the output of the analysis reflects how frequently items co-occur in transactions. This is a function both of the strength of association between the items, and the way the FDMart owner has presented them in different aisles. The closely related item might reoccur several times not because they are “naturally” connected, but there is a chance because the way they are placed in a shelve is motivating people to buy these items together. Item pairs that are frequently selling together should be placed close together within broader categories. Like for example our data tells that wine and candles are sold very often together. Since good wines are costly and bring high profits to business, it can be placed next to fragrant candles. In this way grocery owner can couple most popular item like candle with items that has high margin like wine. In addition , grocery store can offer discount coupons for the items that are frequently bundled together like beer and gums .Giving some discounts and placing these two item next to each other will motivate those buyers to buy beer ,who just visited shop to purchase gums. This will drive significant uplift in profit. Mostly baskets have 5 to 6 items in a itemset. Coupling high price items with low price products can make a significant increase in total number of items purchased in a single transaction. Moreover, Items that are very likely to be sold together for example milk and cereal can be couple with sliced bread. Since milk and cereal goes very frequently and milk also goes with sliced bread also. We can connect bread and cereal in order to increase sale of bread. There are so many possibilities as you go on with the analysis with data mining.