suppressWarnings(suppressMessages(library(data.table)))
suppressWarnings(suppressMessages(library(knitr)))
suppressWarnings(suppressMessages(library(arules)))
suppressWarnings(suppressMessages(library(arulesViz)))
Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
groc <- read.transactions("https://raw.githubusercontent.com/gpsingh12/Data624/master/Recommender-System/GroceryDataSet.csv", sep=",")
summary(groc)
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
head(sort(itemFrequency(groc), decreasing =TRUE), n=10)
## whole milk other vegetables rolls/buns soda
## 0.25551601 0.19349263 0.18393493 0.17437722
## yogurt bottled water root vegetables tropical fruit
## 0.13950178 0.11052364 0.10899847 0.10493137
## shopping bags sausage
## 0.09852567 0.09395018
itemFrequencyPlot(groc, topN=10)
We will utilize the transactions data to find association between the items listed in the dataset. Apriori algorithm will be used to build the associated rules between the items. For using this function, we will try to get famiiar with the parametrs Support and Confidence.
Support : This gives the frequency (no. of times the item occurred) of the item in the dataset. If you consider a basket containing 10 items(5-apples, 3-eggs, 2-pens) then support of any precise item say apple can be 5 as mentioned. Likewise precise value can be calculated by the proportion of number of occurrences to the total number of items in the basket ( i.e., support(apples) = 5/8). In our case, we will flag the items that are sold once a day. The total no of transactions (nrow(groc)) is 9835.
Confidence : This explains how likely Y is purchased when X is purchased. This defines association between two items. For example when a person buys milk is more likely to buy bread as well or vice versa. This is measured by the proportion of transactions with item X, in which item Y also appears. Expressed as {X -> Y}. Calculated by the proportion of number of transactions in which both (X & Y) occurs to support of the item X.
MaxLen is set to two, since we do not want the transactions for which minimum items is one.
https://www.quora.com/What-is-support-and-confidence-in-data-mining
We will set the rules in the apriori algorithm.
rule_groc <- apriori(groc, parameter = list(support = 0.001, confidence = 0.6, minlen = 2))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.6 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.02s].
## checking subsets of size 1 2 3 4 5 6 done [0.00s].
## writing ... [2918 rule(s)] done [0.00s].
## creating S4 object ... done [0.01s].
Inspect rules - Top 10 rules by lift
inspect(sort(rule_groc, by = "lift")[1:10])
## lhs rhs support confidence lift count
## [1] {Instant food products,
## soda} => {hamburger meat} 0.001220132 0.6315789 18.995654 12
## [2] {popcorn,
## soda} => {salty snack} 0.001220132 0.6315789 16.697793 12
## [3] {ham,
## processed cheese} => {white bread} 0.001931876 0.6333333 15.045491 19
## [4] {other vegetables,
## tropical fruit,
## white bread,
## yogurt} => {butter} 0.001016777 0.6666667 12.030581 10
## [5] {hamburger meat,
## whipped/sour cream,
## yogurt} => {butter} 0.001016777 0.6250000 11.278670 10
## [6] {domestic eggs,
## other vegetables,
## tropical fruit,
## whole milk,
## yogurt} => {butter} 0.001016777 0.6250000 11.278670 10
## [7] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.235269 19
## [8] {butter,
## other vegetables,
## sugar} => {whipped/sour cream} 0.001016777 0.7142857 9.964539 10
## [9] {butter,
## hard cheese,
## whole milk} => {whipped/sour cream} 0.001423488 0.6666667 9.300236 14
## [10] {butter,
## fruit/vegetable juice,
## other vegetables,
## tropical fruit} => {whipped/sour cream} 0.001016777 0.6666667 9.300236 10
We have selected our top10 rules based on the lift. We can see from the list instant food products and soda are sold much along with the hamburger meat. Similarly association of salty snack with pop corn and soda is very high. Apart from that we can see from no. 7 that confidence of liquor,red/blush wine along with the bottled beer is almost 90%. Chances are very high if some one buy liquor/red wine will also buy bottled beer. The rules can be sorted by the confidence level also to find the items sold under this category.
suppressWarnings(plot(rule_groc[1:10], method="graph", control=list(type="items")))
## Available control parameters (with default values):
## main = Graph for 10 rules
## nodeColors = c("#66CC6680", "#9999CC80")
## nodeCol = c("#EE0000FF", "#EE0303FF", "#EE0606FF", "#EE0909FF", "#EE0C0CFF", "#EE0F0FFF", "#EE1212FF", "#EE1515FF", "#EE1818FF", "#EE1B1BFF", "#EE1E1EFF", "#EE2222FF", "#EE2525FF", "#EE2828FF", "#EE2B2BFF", "#EE2E2EFF", "#EE3131FF", "#EE3434FF", "#EE3737FF", "#EE3A3AFF", "#EE3D3DFF", "#EE4040FF", "#EE4444FF", "#EE4747FF", "#EE4A4AFF", "#EE4D4DFF", "#EE5050FF", "#EE5353FF", "#EE5656FF", "#EE5959FF", "#EE5C5CFF", "#EE5F5FFF", "#EE6262FF", "#EE6666FF", "#EE6969FF", "#EE6C6CFF", "#EE6F6FFF", "#EE7272FF", "#EE7575FF", "#EE7878FF", "#EE7B7BFF", "#EE7E7EFF", "#EE8181FF", "#EE8484FF", "#EE8888FF", "#EE8B8BFF", "#EE8E8EFF", "#EE9191FF", "#EE9494FF", "#EE9797FF", "#EE9999FF", "#EE9B9BFF", "#EE9D9DFF", "#EE9F9FFF", "#EEA0A0FF", "#EEA2A2FF", "#EEA4A4FF", "#EEA5A5FF", "#EEA7A7FF", "#EEA9A9FF", "#EEABABFF", "#EEACACFF", "#EEAEAEFF", "#EEB0B0FF", "#EEB1B1FF", "#EEB3B3FF", "#EEB5B5FF", "#EEB7B7FF", "#EEB8B8FF", "#EEBABAFF", "#EEBCBCFF", "#EEBDBDFF", "#EEBFBFFF", "#EEC1C1FF", "#EEC3C3FF", "#EEC4C4FF", "#EEC6C6FF", "#EEC8C8FF", "#EEC9C9FF", "#EECBCBFF", "#EECDCDFF", "#EECFCFFF", "#EED0D0FF", "#EED2D2FF", "#EED4D4FF", "#EED5D5FF", "#EED7D7FF", "#EED9D9FF", "#EEDBDBFF", "#EEDCDCFF", "#EEDEDEFF", "#EEE0E0FF", "#EEE1E1FF", "#EEE3E3FF", "#EEE5E5FF", "#EEE7E7FF", "#EEE8E8FF", "#EEEAEAFF", "#EEECECFF", "#EEEEEEFF")
## edgeCol = c("#474747FF", "#494949FF", "#4B4B4BFF", "#4D4D4DFF", "#4F4F4FFF", "#515151FF", "#535353FF", "#555555FF", "#575757FF", "#595959FF", "#5B5B5BFF", "#5E5E5EFF", "#606060FF", "#626262FF", "#646464FF", "#666666FF", "#686868FF", "#6A6A6AFF", "#6C6C6CFF", "#6E6E6EFF", "#707070FF", "#727272FF", "#747474FF", "#767676FF", "#787878FF", "#7A7A7AFF", "#7C7C7CFF", "#7E7E7EFF", "#808080FF", "#828282FF", "#848484FF", "#868686FF", "#888888FF", "#8A8A8AFF", "#8C8C8CFF", "#8D8D8DFF", "#8F8F8FFF", "#919191FF", "#939393FF", "#959595FF", "#979797FF", "#999999FF", "#9A9A9AFF", "#9C9C9CFF", "#9E9E9EFF", "#A0A0A0FF", "#A2A2A2FF", "#A3A3A3FF", "#A5A5A5FF", "#A7A7A7FF", "#A9A9A9FF", "#AAAAAAFF", "#ACACACFF", "#AEAEAEFF", "#AFAFAFFF", "#B1B1B1FF", "#B3B3B3FF", "#B4B4B4FF", "#B6B6B6FF", "#B7B7B7FF", "#B9B9B9FF", "#BBBBBBFF", "#BCBCBCFF", "#BEBEBEFF", "#BFBFBFFF", "#C1C1C1FF", "#C2C2C2FF", "#C3C3C4FF", "#C5C5C5FF", "#C6C6C6FF", "#C8C8C8FF", "#C9C9C9FF", "#CACACAFF", "#CCCCCCFF", "#CDCDCDFF", "#CECECEFF", "#CFCFCFFF", "#D1D1D1FF", "#D2D2D2FF", "#D3D3D3FF", "#D4D4D4FF", "#D5D5D5FF", "#D6D6D6FF", "#D7D7D7FF", "#D8D8D8FF", "#D9D9D9FF", "#DADADAFF", "#DBDBDBFF", "#DCDCDCFF", "#DDDDDDFF", "#DEDEDEFF", "#DEDEDEFF", "#DFDFDFFF", "#E0E0E0FF", "#E0E0E0FF", "#E1E1E1FF", "#E1E1E1FF", "#E2E2E2FF", "#E2E2E2FF", "#E2E2E2FF")
## alpha = 0.5
## cex = 1
## itemLabels = TRUE
## labelCol = #000000B3
## measureLabels = FALSE
## precision = 3
## layout = NULL
## layoutParams = list()
## arrowSize = 0.5
## engine = igraph
## plot = TRUE
## plot_options = list()
## max = 100
## verbose = FALSE
The graph represents the association of items with another items being sold. Setting the confidence level, we can perform further analysis.
Further Analysis : Based on specific requirements, parameters of the apriori algorithm can be set to get required results. For example: if the sale of milk is high, we can possibly find another set of items that are associated with milk and has higher chances of being sold along with milk from our algorithm.