Problem 1
Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’. That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
Extra credit: do a simple cluster analysis on the data as well. Use whichever packages you like. Due May 5 before midnight.
I’ll be using the arules package, which provides the infrastructure for representing, manipulating and analyzing transaction data and patterns (frequent itemsets and association rules). It also provides C implementations of the association mining algorithms Apriori and Eclat. We’ll begin by loading our data and using the read.transactions function to convert our cvs into a format that can be analyzed with the arules package.
library(arules)
grocery.data <- read.transactions("GroceryDataSet.csv", format = "basket", header = FALSE, sep = ",")
For the format parameter we can either go with ‘basket’, whereby each line in the transaction data file represents a transaction where the items (item labels) are separated by the characters specified by sep or ‘single’, wher each line corresponds to a single item, containing at least ids for the transaction and the item. From the instructions, we know that each line represents a transactions and each column is an item.
Now that we have our data loaded, we’ll tak a look to see the frequency of each item in the dataset using the itemFrequencyPlot() function.
A few things that I noticed was that whole milk appeared the most. Other vegetables, rolls/buns, soda an yogurt round out the top 5 most frequent items. I wondered why they put whipped and sour cream together. Now I’ll use the apriori function to mine and fit assosiation rules to our dataset.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 3 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [29 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 29 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 29
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001017 Min. :0.8000 Min. : 3.131 Min. :10.00
## 1st Qu.:0.001118 1st Qu.:0.8125 1st Qu.: 3.261 1st Qu.:11.00
## Median :0.001220 Median :0.8462 Median : 3.613 Median :12.00
## Mean :0.001473 Mean :0.8613 Mean : 4.000 Mean :14.48
## 3rd Qu.:0.001729 3rd Qu.:0.9091 3rd Qu.: 4.199 3rd Qu.:17.00
## Max. :0.002542 Max. :1.0000 Max. :11.235 Max. :25.00
##
## mining info:
## data ntransactions support confidence
## grocery.data 9835 0.001 0.8
Using the is.redundant function, we will remove redundancies and then use the inspect function to see how everything ranks. We will take a look at our output ranked by confidence-measure of probability that the association rule will be correct for out of sample data, lift - measure of effectiveness of the rule in finding consequents, and support - frequency of the relationship in the dataset.
## set of 29 rules
##
## rule length distribution (lhs + rhs):sizes
## 3
## 29
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3 3 3 3 3 3
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001017 Min. :0.8000 Min. : 3.131 Min. :10.00
## 1st Qu.:0.001118 1st Qu.:0.8125 1st Qu.: 3.261 1st Qu.:11.00
## Median :0.001220 Median :0.8462 Median : 3.613 Median :12.00
## Mean :0.001473 Mean :0.8613 Mean : 4.000 Mean :14.48
## 3rd Qu.:0.001729 3rd Qu.:0.9091 3rd Qu.: 4.199 3rd Qu.:17.00
## Max. :0.002542 Max. :1.0000 Max. :11.235 Max. :25.00
##
## mining info:
## data ntransactions support confidence
## grocery.data 9835 0.001 0.8
## lhs rhs support confidence lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.235269 19
## [2] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [3] {cereals,
## yogurt} => {whole milk} 0.001728521 0.8095238 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 3.261374 10
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 3.587512 11
## [6] {house keeping products,
## napkins} => {whole milk} 0.001321810 0.8125000 3.179840 13
## [7] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [9] {curd,
## turkey} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [10] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 3.913649 12
## [11] {butter,
## rice} => {whole milk} 0.001525165 0.8333333 3.261374 15
## [12] {domestic eggs,
## rice} => {whole milk} 0.001118454 0.8461538 3.311549 11
## [13] {bottled water,
## rice} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [14] {rice,
## yogurt} => {other vegetables} 0.001931876 0.8260870 4.269346 19
## [15] {mustard,
## oil} => {whole milk} 0.001220132 0.8571429 3.354556 12
## [16] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1.0000000 3.913649 11
## [17] {fruit/vegetable juice,
## herbs} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [18] {herbs,
## shopping bags} => {other vegetables} 0.001931876 0.8260870 4.269346 19
## [19] {herbs,
## tropical fruit} => {whole milk} 0.002338587 0.8214286 3.214783 23
## [20] {herbs,
## rolls/buns} => {whole milk} 0.002440264 0.8000000 3.130919 24
## [21] {chocolate,
## pickled vegetables} => {whole milk} 0.001220132 0.8571429 3.354556 12
## [22] {grapes,
## onions} => {other vegetables} 0.001118454 0.9166667 4.737476 11
## [23] {margarine,
## meat} => {other vegetables} 0.001728521 0.8500000 4.392932 17
## [24] {hard cheese,
## oil} => {other vegetables} 0.001118454 0.9166667 4.737476 11
## [25] {butter milk,
## onions} => {other vegetables} 0.001321810 0.8125000 4.199126 13
## [26] {butter milk,
## pork} => {other vegetables} 0.001830198 0.8571429 4.429848 18
## [27] {onions,
## waffles} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [28] {curd,
## hamburger meat} => {whole milk} 0.002541942 0.8064516 3.156169 25
## [29] {bottled beer,
## hamburger meat} => {whole milk} 0.001728521 0.8095238 3.168192 17
When comparing the summaries of the two rules, I noticed that there was not a difference. I realized this was because we had set the maxlen parameter to 3.
## lhs rhs support confidence lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.235269 19
## [2] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [3] {cereals,
## yogurt} => {whole milk} 0.001728521 0.8095238 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 3.261374 10
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 3.587512 11
## [6] {house keeping products,
## napkins} => {whole milk} 0.001321810 0.8125000 3.179840 13
## [7] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [9] {curd,
## turkey} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [10] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 3.913649 12
## [11] {butter,
## rice} => {whole milk} 0.001525165 0.8333333 3.261374 15
## [12] {domestic eggs,
## rice} => {whole milk} 0.001118454 0.8461538 3.311549 11
## [13] {bottled water,
## rice} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [14] {rice,
## yogurt} => {other vegetables} 0.001931876 0.8260870 4.269346 19
## [15] {mustard,
## oil} => {whole milk} 0.001220132 0.8571429 3.354556 12
## [16] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1.0000000 3.913649 11
## [17] {fruit/vegetable juice,
## herbs} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [18] {herbs,
## shopping bags} => {other vegetables} 0.001931876 0.8260870 4.269346 19
## [19] {herbs,
## tropical fruit} => {whole milk} 0.002338587 0.8214286 3.214783 23
## [20] {herbs,
## rolls/buns} => {whole milk} 0.002440264 0.8000000 3.130919 24
## [21] {chocolate,
## pickled vegetables} => {whole milk} 0.001220132 0.8571429 3.354556 12
## [22] {grapes,
## onions} => {other vegetables} 0.001118454 0.9166667 4.737476 11
## [23] {margarine,
## meat} => {other vegetables} 0.001728521 0.8500000 4.392932 17
## [24] {hard cheese,
## oil} => {other vegetables} 0.001118454 0.9166667 4.737476 11
## [25] {butter milk,
## onions} => {other vegetables} 0.001321810 0.8125000 4.199126 13
## [26] {butter milk,
## pork} => {other vegetables} 0.001830198 0.8571429 4.429848 18
## [27] {onions,
## waffles} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [28] {curd,
## hamburger meat} => {whole milk} 0.002541942 0.8064516 3.156169 25
## [29] {bottled beer,
## hamburger meat} => {whole milk} 0.001728521 0.8095238 3.168192 17
## lhs rhs support confidence lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.235269 19
## [2] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [3] {cereals,
## yogurt} => {whole milk} 0.001728521 0.8095238 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 3.261374 10
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 3.587512 11
## [6] {house keeping products,
## napkins} => {whole milk} 0.001321810 0.8125000 3.179840 13
## [7] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [9] {curd,
## turkey} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [10] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 3.913649 12
## [11] {butter,
## rice} => {whole milk} 0.001525165 0.8333333 3.261374 15
## [12] {domestic eggs,
## rice} => {whole milk} 0.001118454 0.8461538 3.311549 11
## [13] {bottled water,
## rice} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [14] {rice,
## yogurt} => {other vegetables} 0.001931876 0.8260870 4.269346 19
## [15] {mustard,
## oil} => {whole milk} 0.001220132 0.8571429 3.354556 12
## [16] {canned fish,
## hygiene articles} => {whole milk} 0.001118454 1.0000000 3.913649 11
## [17] {fruit/vegetable juice,
## herbs} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [18] {herbs,
## shopping bags} => {other vegetables} 0.001931876 0.8260870 4.269346 19
## [19] {herbs,
## tropical fruit} => {whole milk} 0.002338587 0.8214286 3.214783 23
## [20] {herbs,
## rolls/buns} => {whole milk} 0.002440264 0.8000000 3.130919 24
## [21] {chocolate,
## pickled vegetables} => {whole milk} 0.001220132 0.8571429 3.354556 12
## [22] {grapes,
## onions} => {other vegetables} 0.001118454 0.9166667 4.737476 11
## [23] {margarine,
## meat} => {other vegetables} 0.001728521 0.8500000 4.392932 17
## [24] {hard cheese,
## oil} => {other vegetables} 0.001118454 0.9166667 4.737476 11
## [25] {butter milk,
## onions} => {other vegetables} 0.001321810 0.8125000 4.199126 13
## [26] {butter milk,
## pork} => {other vegetables} 0.001830198 0.8571429 4.429848 18
## [27] {onions,
## waffles} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [28] {curd,
## hamburger meat} => {whole milk} 0.002541942 0.8064516 3.156169 25
## [29] {bottled beer,
## hamburger meat} => {whole milk} 0.001728521 0.8095238 3.168192 17
## support confidence lift count
## 1 0.001931876 0.9047619 11.235269 19
## 2 0.001016777 0.9090909 3.557863 10
## 3 0.001728521 0.8095238 3.168192 17
## 4 0.001016777 0.8333333 3.261374 10
## 5 0.001118454 0.9166667 3.587512 11
## 6 0.001321810 0.8125000 3.179840 13
I’m going to go ahead and plot to see the associations.