Part 1

First, install the package arules

Second, load the data set using the following command grd <- read.transactions(“http://fimi.ua.ac.be/data/retail.dat”, format=“basket”)

Third, run the following commands and interpret the results itemFrequencyPlot(grd,support=.1) #run with support .2, .3, & .5 summary(grd) inspect(grd) #you will have to stop the listing manually

Create the rules object using apriori grdar <- apriori(grd,parameter=list(supp=.05,conf=.5)) inspect(grdar)

require(arules)

## Loading required package: arules

## Warning: package 'arules' was built under R version 3.4.4

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

grd <- read.transactions("http://fimi.ua.ac.be/data/retail.dat", format = "basket")
itemFrequencyPlot(grd, support = .1)

itemFrequencyPlot(grd, support = .2)
itemFrequencyPlot(grd, support = .3)

itemFrequencyPlot(grd, support = .5)

summary(grd)

## transactions as itemMatrix in sparse format with
##  88162 rows (elements/itemsets/transactions) and
##  16470 columns (items) and a density of 0.0006257289 
## 
## most frequent items:
##      39      48      38      32      41 (Other) 
##   50675   42135   15596   15167   14945  770058 
## 
## element (itemset/transaction) length distribution:
## sizes
##    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15 
## 3016 5516 6919 7210 6814 6163 5746 5143 4660 4086 3751 3285 2866 2620 2310 
##   16   17   18   19   20   21   22   23   24   25   26   27   28   29   30 
## 2115 1874 1645 1469 1290 1205  981  887  819  684  586  582  472  480  355 
##   31   32   33   34   35   36   37   38   39   40   41   42   43   44   45 
##  310  303  272  234  194  136  153  123  115  112   76   66   71   60   50 
##   46   47   48   49   50   51   52   53   54   55   56   57   58   59   60 
##   44   37   37   33   22   24   21   21   10   11   10    9   11    4    9 
##   61   62   63   64   65   66   67   68   71   73   74   76 
##    7    4    5    2    2    5    3    3    1    1    1    1 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    4.00    8.00   10.31   14.00   76.00 
## 
## includes extended item information - examples:
##   labels
## 1      0
## 2      1
## 3     10

grdar <- apriori(grd, parameter = list(supp = .05, conf = .5))

## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.5    0.1    1 none FALSE            TRUE       5    0.05      1
##  maxlen target   ext
##      10  rules FALSE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 4408 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[16470 item(s), 88162 transaction(s)] done [0.22s].
## sorting and recoding items ... [6 item(s)] done [0.02s].
## creating transaction tree ... done [0.05s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [15 rule(s)] done [0.00s].
## creating S4 object  ... done [0.02s].

inspect(grdar)

##      lhs        rhs  support    confidence lift      count
## [1]  {}      => {39} 0.57479413 0.5747941  1.0000000 50675
## [2]  {38}    => {48} 0.09010685 0.5093614  1.0657723  7944
## [3]  {38}    => {39} 0.11734080 0.6633111  1.1539977 10345
## [4]  {32}    => {48} 0.09112770 0.5297026  1.1083338  8034
## [5]  {32}    => {39} 0.09590300 0.5574603  0.9698434  8455
## [6]  {41}    => {48} 0.10228897 0.6034125  1.2625621  9018
## [7]  {41}    => {39} 0.12946621 0.7637337  1.3287082 11414
## [8]  {48}    => {39} 0.33055058 0.6916340  1.2032726 29142
## [9]  {39}    => {48} 0.33055058 0.5750765  1.2032726 29142
## [10] {38,48} => {39} 0.06921349 0.7681269  1.3363513  6102
## [11] {38,39} => {48} 0.06921349 0.5898502  1.2341847  6102
## [12] {32,48} => {39} 0.06127356 0.6723923  1.1697968  5402
## [13] {32,39} => {48} 0.06127356 0.6389119  1.3368399  5402
## [14] {41,48} => {39} 0.08355074 0.8168108  1.4210493  7366
## [15] {39,41} => {48} 0.08355074 0.6453478  1.3503063  7366

Part 2

Now that you have rules,…

1.Find a few interesting rules

a.Rule #1 tells us that 57% of the transactions contain item #39 b.Rule #3, #7, #8, #10, #12, and #14 with high lift value, tells us that when a transaction contains either of the following item or item combination: item #38, #41, #48, both #38 and #48, both #32 and #48, or both #41 and #48, the probability to contain item #39 is high.

Tell me something you learned from interpreting the rules Show all your steps (especially in data conversion) using knitR

If an item appears more frequently, the frequency it shows in rules is relatively higher when a minimum support and a minimum confidence is set up. For instance, item #39 has the highest frequency (57%) among all items, and it is contained in 12 out of the 15 rules.
A relatively greater lift value (usually larger 1) indicates a stronger association between the left-hand side item(s) and the right-hand side item(s).

3.Show all your steps (especially in data conversion) using knitR

Part 3

Next, tell what you would like to do next with the retail data

1.Is there a hypothesis you would like to test?

Since item 39 has 57% probability of occurrence, it would be idea to conduct a Chi-squared test to check whether the quantity of sales depends on item 39. If the quantity of sales depends on item 39, then I can increase my quantity of sales through attracting people to by item 39.

2.Is there data from another source you would like to add?

It would be helpful to add the unmasked data if possible, for example, what these anonymized items really represent. Also, the business context would satisfy us for asking more meaningful questions based on the data. Without the background information, this data set would only allow us to practice the Association Analysis packages in R, but limitate us from generate business insights or to use the rule in real world.

3.Is there a predictive model you would like to build?

The dependent variable would be a binary variable indicating whether a specific transaction contains item #39. Independent variables could be whether this transaction contains other items, and some possbile interactions between these variables. The idea model could be either a decision tree or logistic regression

4.Anything else? 5.Tell me what would be interesting to you to do next

Something that may be interesting to me would be lowering the support and confidence thresholds, to check whether there is any rule with a higher lift. We know that if the parameter threshhold is successfully lowered, more rules with a righ-hand side item and low frequency could be identified, however, a high confidence value after adding the left-hand side item. A higher lift would bring us a stronger association between the right-hand side items and left-hand side items.

Tell me about project you would like to do with Association Analysis It can be a project at work Or, suppose you could download data from data.gov on healthcare, or education, or whatever What would you like to do Association Analysis on if you could

ANSWER: If possible to get the traffic accident data, I would like to do some association analysis to check the link between drivers that caused the accident.

Assignment #5

Chunqi Xu

2018-07-16

Part 1

Part 2

Part 3