Problem

Part 1

  1. First, install the package arules

  2. Second, load the data set using the following command

library(arules)
## Warning: package 'arules' was built under R version 3.4.4
## Loading required package: Matrix
## 
## Attaching package: 'arules'
## The following objects are masked from 'package:base':
## 
##     abbreviate, write
grd <- read.transactions("http://fimi.ua.ac.be/data/retail.dat", format="basket")
  1. Third, run the following commands and interpret the results and run with support .2,.3,&.5
itemFrequencyPlot(grd,support=.1) 

itemFrequencyPlot(grd,support=.2)

itemFrequencyPlot(grd,support=.3)

itemFrequencyPlot(grd,support=.5)

summary(grd)
## transactions as itemMatrix in sparse format with
##  88162 rows (elements/itemsets/transactions) and
##  16470 columns (items) and a density of 0.0006257289 
## 
## most frequent items:
##      39      48      38      32      41 (Other) 
##   50675   42135   15596   15167   14945  770058 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 3016 5516 6919 7210 6814 6163 5746 5143 4660 4086 3751 3285 2866 2620 2310 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 2115 1874 1645 1469 1290 1205  981  887  819  684  586  582  472  480  355 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##  310  303  272  234  194  136  153  123  115  112   76   66   71   60   50 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   44   37   37   33   22   24   21   21   10   11   10    9   11    4    9 
##   61   62   63   64   65   66   67   68   71   73   74   76 
##    7    4    5    2    2    5    3    3    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    8.00   10.31   14.00   76.00 
## 
## includes extended item information - examples:
##   labels
## 1      0
## 2      1
## 3     10
#inspect(grd) #you will have to stop the listing manually
  1. Create the rules object using apriori
grdar <- apriori(grd,parameter=list(supp=.05,conf=.5))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.05      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4408 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[16470 item(s), 88162 transaction(s)] done [0.14s].
## sorting and recoding items ... [6 item(s)] done [0.00s].
## creating transaction tree ... done [0.03s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].
#inspect(grdar)

Part 2

Now that you have rules;

  1. Find a few interesting rules
    • Rule #1 suggests that 57% of the transactions include item #39
    • Rule #3, #7, #8, #10, #12, and #14, with their high lift value, suggests that when a transaction includes item #38, or #41, or #48, or both #38 and #48, or both #32 and #48, or both #41 and #48, it is more likely to include item #39. But only #5 have less lift value comparison to other.
  2. Tell me something you learned from interpreting the rules
    • A lift value that is greater than 1 indicates a stronger association between the left-hand side item(s) and the right-hand side item.
  3. Show all your steps (especially in data conversion) using knitR - done in this project

Part 3

Next, tell what you would like to do next with the retail data

  1. Is there a hypothesis you would like to test?

The lift value of rule #2: {38} => {48} is merely 1.07. A chi-squared test could be performed to find whether these two events are independent: whether a transaction includes item #38, and whether a transaction includes item #48

  1. Is there data from another source you would like to add?
  2. Is there a predictive model you would like to build? The target variable would be whether a transaction includes item #39, which would be a binary variable. Input variables could be whether a transaction includes each of the other items, and possibly some interactions between those. The statistical model could be either logistic regression or decision tree.
  3. Anything else?
  4. Tell me what would be interesting to you to do next

Tell me about project you would like to do with Association Analysis

  • It can be a project at work
  • Or, suppose you could download data from data.gov on healthcare, or education, or whatever
  • What would you like to do Association Analysis on if you could

Use Association Analysis to research on non-smoker people who get killed with lung cancer deaths.