First, install the package arules
Second, load the data set using the following command grd <- read.transactions(“http://fimi.ua.ac.be/data/retail.dat”, format=“basket”)
Third, run the following commands and interpret the results itemFrequencyPlot(grd,support=.1) #run with support .2, .3, & .5 summary(grd) inspect(grd) #you will have to stop the listing manually
Create the rules object using apriori grdar <- apriori(grd,parameter=list(supp=.05,conf=.5)) inspect(grdar)
require(arules)
## Loading required package: arules
## Warning: package 'arules' was built under R version 3.4.4
## Loading required package: Matrix
##
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
##
## abbreviate, write
grd <- read.transactions("http://fimi.ua.ac.be/data/retail.dat", format = "basket")
itemFrequencyPlot(grd, support = .1)
itemFrequencyPlot(grd, support = .2)
itemFrequencyPlot(grd, support = .3)
itemFrequencyPlot(grd, support = .5)
summary(grd)
## transactions as itemMatrix in sparse format with
## 88162 rows (elements/itemsets/transactions) and
## 16470 columns (items) and a density of 0.0006257289
##
## most frequent items:
## 39 48 38 32 41 (Other)
## 50675 42135 15596 15167 14945 770058
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 3016 5516 6919 7210 6814 6163 5746 5143 4660 4086 3751 3285 2866 2620 2310
## 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
## 2115 1874 1645 1469 1290 1205 981 887 819 684 586 582 472 480 355
## 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45
## 310 303 272 234 194 136 153 123 115 112 76 66 71 60 50
## 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 44 37 37 33 22 24 21 21 10 11 10 9 11 4 9
## 61 62 63 64 65 66 67 68 71 73 74 76
## 7 4 5 2 2 5 3 3 1 1 1 1
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 4.00 8.00 10.31 14.00 76.00
##
## includes extended item information - examples:
## labels
## 1 0
## 2 1
## 3 10
grdar <- apriori(grd, parameter = list(supp = .05, conf = .5))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.5 0.1 1 none FALSE TRUE 5 0.05 1
## maxlen target ext
## 10 rules FALSE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 4408
##
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[16470 item(s), 88162 transaction(s)] done [0.22s].
## sorting and recoding items ... [6 item(s)] done [0.02s].
## creating transaction tree ... done [0.05s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object ... done [0.02s].
inspect(grdar)
## lhs rhs support confidence lift count
## [1] {} => {39} 0.57479413 0.5747941 1.0000000 50675
## [2] {38} => {48} 0.09010685 0.5093614 1.0657723 7944
## [3] {38} => {39} 0.11734080 0.6633111 1.1539977 10345
## [4] {32} => {48} 0.09112770 0.5297026 1.1083338 8034
## [5] {32} => {39} 0.09590300 0.5574603 0.9698434 8455
## [6] {41} => {48} 0.10228897 0.6034125 1.2625621 9018
## [7] {41} => {39} 0.12946621 0.7637337 1.3287082 11414
## [8] {48} => {39} 0.33055058 0.6916340 1.2032726 29142
## [9] {39} => {48} 0.33055058 0.5750765 1.2032726 29142
## [10] {38,48} => {39} 0.06921349 0.7681269 1.3363513 6102
## [11] {38,39} => {48} 0.06921349 0.5898502 1.2341847 6102
## [12] {32,48} => {39} 0.06127356 0.6723923 1.1697968 5402
## [13] {32,39} => {48} 0.06127356 0.6389119 1.3368399 5402
## [14] {41,48} => {39} 0.08355074 0.8168108 1.4210493 7366
## [15] {39,41} => {48} 0.08355074 0.6453478 1.3503063 7366
Now that you have rules,…
1.Find a few interesting rules
a.Rule #1 tells us that 57% of the transactions contain item #39 b.Rule #3, #7, #8, #10, #12, and #14 with high lift value, tells us that when a transaction contains either of the following item or item combination: item #38, #41, #48, both #38 and #48, both #32 and #48, or both #41 and #48, the probability to contain item #39 is high.
If an item appears more frequently, the frequency it shows in rules is relatively higher when a minimum support and a minimum confidence is set up. For instance, item #39 has the highest frequency (57%) among all items, and it is contained in 12 out of the 15 rules.
A relatively greater lift value (usually larger 1) indicates a stronger association between the left-hand side item(s) and the right-hand side item(s).
3.Show all your steps (especially in data conversion) using knitR
Next, tell what you would like to do next with the retail data
1.Is there a hypothesis you would like to test?
Since item 39 has 57% probability of occurrence, it would be idea to conduct a Chi-squared test to check whether the quantity of sales depends on item 39. If the quantity of sales depends on item 39, then I can increase my quantity of sales through attracting people to by item 39.
2.Is there data from another source you would like to add?
It would be helpful to add the unmasked data if possible, for example, what these anonymized items really represent. Also, the business context would satisfy us for asking more meaningful questions based on the data. Without the background information, this data set would only allow us to practice the Association Analysis packages in R, but limitate us from generate business insights or to use the rule in real world.
3.Is there a predictive model you would like to build?
The dependent variable would be a binary variable indicating whether a specific transaction contains item #39. Independent variables could be whether this transaction contains other items, and some possbile interactions between these variables. The idea model could be either a decision tree or logistic regression
4.Anything else? 5.Tell me what would be interesting to you to do next
Something that may be interesting to me would be lowering the support and confidence thresholds, to check whether there is any rule with a higher lift. We know that if the parameter threshhold is successfully lowered, more rules with a righ-hand side item and low frequency could be identified, however, a high confidence value after adding the left-hand side item. A higher lift would bring us a stronger association between the right-hand side items and left-hand side items.
Tell me about project you would like to do with Association Analysis It can be a project at work Or, suppose you could download data from data.gov on healthcare, or education, or whatever What would you like to do Association Analysis on if you could
ANSWER: If possible to get the traffic accident data, I would like to do some association analysis to check the link between drivers that caused the accident.