关联规则及可视化

这个是购物篮的个性推荐数据分析练习，通过关联规则从大量数据中挖掘出有价值的数据项之间的相关关系:

先安装程序包（“arulesViz”），以Groceries数据为例，超市购物的例子，每一行为一个顾客的购买记录，格式为transactions

## Loading required package: Matrix
## 
## Attaching package: 'arules'
## 
## 下列对象被屏蔽了from 'package:base':
## 
##     %in%, write
## 
## Loading required package: grid
## 
## Attaching package: 'arulesViz'
## 
## 下列对象被屏蔽了from 'package:base':
## 
##     abbreviate

将数据格式转换为“data.frame”

df.Gro = as(Groceries, "data.frame")
head(df.Gro)

##                                                                   items
## 1              {citrus fruit,semi-finished bread,margarine,ready soups}
## 2                                        {tropical fruit,yogurt,coffee}
## 3                                                          {whole milk}
## 4                         {pip fruit,yogurt,cream cheese ,meat spreads}
## 5 {other vegetables,whole milk,condensed milk,long life bakery product}
## 6                      {whole milk,butter,yogurt,rice,abrasive cleaner}

下一步，我们使用Apriori算法挖掘关联规则。算法基本概念和核心步骤：http://blog.csdn.net/lizhengnanhua/article/details/9061755

支持度设定为0.1%，置信度设置为50%

支持度3%：意味着3%顾客同时购买牛奶和面包

置信度40%：意味着购买牛奶的顾客40%也购买面包

rules  <-  apriori(Groceries,  parameter=list(support=0.001,  confidence=0.5))

## 
## parameter specification:
##  confidence minval smax arem  aval originalSupport support minlen maxlen
##         0.5    0.1    1 none FALSE            TRUE   0.001      1     10
##  target   ext
##   rules FALSE
## 
## algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09)        (c) 1996-2004   Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.02s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.05s].
## writing ... [5668 rule(s)] done [0.01s].
## creating S4 object  ... done [0.02s].

rules

## set of 5668 rules

根据lift测量方法，查看最前面的10条关联规则. 度量标准还有很多很多很多，如：All-confidence、Consine、Conviction、Jaccard、Leverage、Collective strength等等。

inspect(head(sort(rules,  by  ="lift"),10))

##    lhs                        rhs               support confidence  lift
## 1  {Instant food products,                                              
##     soda}                  => {hamburger meat} 0.001220     0.6316 19.00
## 2  {soda,                                                               
##     popcorn}               => {salty snack}    0.001220     0.6316 16.70
## 3  {flour,                                                              
##     baking powder}         => {sugar}          0.001017     0.5556 16.41
## 4  {ham,                                                                
##     processed cheese}      => {white bread}    0.001932     0.6333 15.05
## 5  {whole milk,                                                         
##     Instant food products} => {hamburger meat} 0.001525     0.5000 15.04
## 6  {other vegetables,                                                   
##     curd,                                                               
##     yogurt,                                                             
##     whipped/sour cream}    => {cream cheese }  0.001017     0.5882 14.83
## 7  {processed cheese,                                                   
##     domestic eggs}         => {white bread}    0.001118     0.5238 12.44
## 8  {tropical fruit,                                                     
##     other vegetables,                                                   
##     yogurt,                                                             
##     white bread}           => {butter}         0.001017     0.6667 12.03
## 9  {hamburger meat,                                                     
##     yogurt,                                                             
##     whipped/sour cream}    => {butter}         0.001017     0.6250 11.28
## 10 {tropical fruit,                                                     
##     other vegetables,                                                   
##     whole milk,                                                         
##     yogurt,                                                             
##     domestic eggs}         => {butter}         0.001017     0.6250 11.28

利用矩阵的方式可视化显示分组结果

plot(rules,  method="grouped")

plot of chunk unnamed-chunk-5

只看置信度大于80%的规则，并用矩阵的方式显示

subrules  <-  rules[quality(rules)$confidence  >  0.8]
plot(subrules,  method="grouped")

plot of chunk unnamed-chunk-6

Graph-based visualizations 基于图形的可视化方法选取lift最高的10个规则

subrules2  <-  head(sort(rules,  by="lift"),  10)
plot(subrules2,  method="graph")

plot of chunk unnamed-chunk-7

plot(subrules2,  method="graph",control=list(type="items"))

plot of chunk unnamed-chunk-7

Parallel coordinates plot

平行坐标图用于可视化多维度的数据，每个维度在x轴和y轴分别显示下图显示了10个规则的平行坐标图，箭头的宽度表示支持度，颜色深度表示置信度。

reorder 重新排列的步骤如下： 1. 随机选择两个项目，如果提升目标函数（objective function）则交换 2. 重复第一步直到达到预定的尝试次数后也没有任何改进。

plot(subrules2,  method="paracoord")

plot of chunk unnamed-chunk-8

plot(subrules2,  method="paracoord",  control=list(reorder=TRUE))

plot of chunk unnamed-chunk-8

Doubledecker

Doubledecker图用于分析单独一条规则

oneRule  <-  sample(rules,  1)
inspect(oneRule)

##   lhs               rhs               support confidence  lift
## 1 {citrus fruit,                                              
##    pip fruit,                                                 
##    yogurt}       => {tropical fruit} 0.001627        0.5 4.765

plot(oneRule,  method="doubledecker",  data  =  Groceries)

plot of chunk unnamed-chunk-9

关联规则及可视化

JiahuiChen

Sunday, July 06, 2014