Homework 13: Imagine 10000 receipts sitting on your table. Each receipt represents a transaction with items that were purchased. The receipt is a representation of stuff that went into a customer’s basket - and therefore ‘Market Basket Analysis’.That is exactly what the Groceries Data Set contains: a collection of receipts with each line representing 1 receipt and the items purchased. Each line is called a transaction and each column in a row represents an item. The data set is attached. Your assignment is to use R to mine the data for association rules. You should report support, confidence and lift and your top 10 rules by lift.
First step is to read in the data and do a prilimary analysis of the data.
## transactions as itemMatrix in sparse format with
## 9835 rows (elements/itemsets/transactions) and
## 169 columns (items) and a density of 0.02609146
##
## most frequent items:
## whole milk other vegetables rolls/buns soda
## 2513 1903 1809 1715
## yogurt (Other)
## 1372 34055
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2159 1643 1299 1005 855 645 545 438 350 246 182 117 78 77 55
## 16 17 18 19 20 21 22 23 24 26 27 28 29 32
## 46 29 14 14 9 11 4 6 1 1 1 1 3 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 3.000 4.409 6.000 32.000
##
## includes extended item information - examples:
## labels
## 1 abrasive cleaner
## 2 artif. sweetener
## 3 baby cosmetics
Using the read.transactions function from the arules package we can see what all the transaction in the CSV look like, the density of the transactions, the most frequent items, etc. I’m curious to know what ‘baby cosmetics’ are.
# Create an item frequency plot for the top 20 items
if (!require("RColorBrewer")) {
# install color package of R
install.packages("RColorBrewer")
#include library RColorBrewer
library(RColorBrewer)
}## Loading required package: RColorBrewer
I used the itemFrequencyPlot to create an item Frequency Bar Plot to view the distribution of objects based on itemMatrix.
Next step is to mine the rules using the APRIORI algorithm. The function apriori() is from package arules.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.02s].
## writing ... [410 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 410 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 29 229 140 12
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.329 5.000 6.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001017 Min. :0.8000 Min. : 3.131 Min. :10.00
## 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.: 3.312 1st Qu.:10.00
## Median :0.001220 Median :0.8462 Median : 3.588 Median :12.00
## Mean :0.001247 Mean :0.8663 Mean : 3.951 Mean :12.27
## 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.: 4.341 3rd Qu.:13.00
## Max. :0.003152 Max. :1.0000 Max. :11.235 Max. :31.00
##
## mining info:
## data ntransactions support confidence
## data 9835 0.001 0.8
There are 410 rules generated by this algorithm, I will take a look at the top 10 rules by confidence level.
## lhs rhs support confidence lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.235269 19
## [2] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [3] {cereals,
## yogurt} => {whole milk} 0.001728521 0.8095238 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 3.261374 10
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 3.587512 11
## [6] {house keeping products,
## napkins} => {whole milk} 0.001321810 0.8125000 3.179840 13
## [7] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [9] {curd,
## turkey} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [10] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 3.913649 12
To make sure that there are not any duplicate rules, I will remove any redundancies from our associated.rules algorithm.
subset.rules <- which(colSums(is.subset(association.rules, association.rules)) > 1)
length(subset.rules)## [1] 91
Now, I will rerun the algorithm and see if there is any change in the rules.
## lhs rhs support confidence lift count
## [1] {liquor,
## red/blush wine} => {bottled beer} 0.001931876 0.9047619 11.235269 19
## [2] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [3] {cereals,
## yogurt} => {whole milk} 0.001728521 0.8095238 3.168192 17
## [4] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 3.261374 10
## [5] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 3.587512 11
## [6] {house keeping products,
## napkins} => {whole milk} 0.001321810 0.8125000 3.179840 13
## [7] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [8] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [9] {curd,
## turkey} => {other vegetables} 0.001220132 0.8000000 4.134524 12
## [10] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 3.913649 12
Looks like there is no change in the top 10 rules. Several of the rules that are associated with whole milk. I would like to target whole milk as bother the lhs and the rms to see if there are any similarities.
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4 5 6 done [0.01s].
## writing ... [252 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 252 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4 5 6
## 18 146 81 7
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 4.000 4.000 4.306 5.000 6.000
##
## summary of quality measures:
## support confidence lift count
## Min. :0.001017 Min. :0.8000 Min. :3.131 Min. :10.00
## 1st Qu.:0.001017 1st Qu.:0.8333 1st Qu.:3.261 1st Qu.:10.00
## Median :0.001220 Median :0.8481 Median :3.319 Median :12.00
## Mean :0.001256 Mean :0.8689 Mean :3.401 Mean :12.36
## 3rd Qu.:0.001322 3rd Qu.:0.9091 3rd Qu.:3.558 3rd Qu.:13.00
## Max. :0.002847 Max. :1.0000 Max. :3.914 Max. :28.00
##
## mining info:
## data ntransactions support confidence
## data 9835 0.001 0.8
## lhs rhs support confidence lift count
## [1] {cereals,
## curd} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [2] {cereals,
## yogurt} => {whole milk} 0.001728521 0.8095238 3.168192 17
## [3] {butter,
## jam} => {whole milk} 0.001016777 0.8333333 3.261374 10
## [4] {bottled beer,
## soups} => {whole milk} 0.001118454 0.9166667 3.587512 11
## [5] {house keeping products,
## napkins} => {whole milk} 0.001321810 0.8125000 3.179840 13
## [6] {house keeping products,
## whipped/sour cream} => {whole milk} 0.001220132 0.9230769 3.612599 12
## [7] {pastry,
## sweet spreads} => {whole milk} 0.001016777 0.9090909 3.557863 10
## [8] {rice,
## sugar} => {whole milk} 0.001220132 1.0000000 3.913649 12
## [9] {butter,
## rice} => {whole milk} 0.001525165 0.8333333 3.261374 15
## [10] {domestic eggs,
## rice} => {whole milk} 0.001118454 0.8461538 3.311549 11
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 9
##
## set item appearances ...[1 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.00s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 done [0.00s].
## writing ... [0 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
## set of 0 rules
It appears that there is only rules when ‘wole milk’ is the right hand side.
Now, let’s graph some of these rules. I will subset the data to only look at the top 10 rules.
Now, I will do the same witht the milk association rules.
The graph plots show the rules using vertices and edges where vertices are labeled with item names. This is a way to visualize the rules of transactions.
Reference: https://www.datacamp.com/community/tutorials/market-basket-analysis-r