这个是购物篮的个性推荐数据分析练习,通过关联规则从大量数据中挖掘出有价值的数据项之间的相关关系:
先安装程序包(“arulesViz”),以Groceries数据为例,超市购物的例子,每一行为一个顾客的购买记录,格式为transactions
## Loading required package: Matrix
##
## Attaching package: 'arules'
##
## 下列对象被屏蔽了from 'package:base':
##
## %in%, write
##
## Loading required package: grid
##
## Attaching package: 'arulesViz'
##
## 下列对象被屏蔽了from 'package:base':
##
## abbreviate
将数据格式转换为“data.frame”
df.Gro = as(Groceries, "data.frame")
head(df.Gro)
## items
## 1 {citrus fruit,semi-finished bread,margarine,ready soups}
## 2 {tropical fruit,yogurt,coffee}
## 3 {whole milk}
## 4 {pip fruit,yogurt,cream cheese ,meat spreads}
## 5 {other vegetables,whole milk,condensed milk,long life bakery product}
## 6 {whole milk,butter,yogurt,rice,abrasive cleaner}
下一步,我们使用Apriori算法挖掘关联规则。算法基本概念和核心步骤:http://blog.csdn.net/lizhengnanhua/article/details/9061755
支持度设定为0.1%, 置信度设置为50%
支持度3%:意味着3%顾客同时购买牛奶和面包
置信度40%:意味着购买牛奶的顾客40%也购买面包
rules <- apriori(Groceries, parameter=list(support=0.001, confidence=0.5))
##
## parameter specification:
## confidence minval smax arem aval originalSupport support minlen maxlen
## 0.5 0.1 1 none FALSE TRUE 0.001 1 10
## target ext
## rules FALSE
##
## algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## apriori - find association rules with the apriori algorithm
## version 4.21 (2004.05.09) (c) 1996-2004 Christian Borgelt
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[169 item(s), 9835 transaction(s)] done [0.02s].
## sorting and recoding items ... [157 item(s)] done [0.00s].
## creating transaction tree ... done [0.01s].
## checking subsets of size 1 2 3 4 5 6 done [0.05s].
## writing ... [5668 rule(s)] done [0.01s].
## creating S4 object ... done [0.02s].
rules
## set of 5668 rules
根据lift测量方法,查看最前面的10条关联规则. 度量标准还有很多很多很多,如:All-confidence、Consine、Conviction、Jaccard、Leverage、Collective strength等等。
inspect(head(sort(rules, by ="lift"),10))
## lhs rhs support confidence lift
## 1 {Instant food products,
## soda} => {hamburger meat} 0.001220 0.6316 19.00
## 2 {soda,
## popcorn} => {salty snack} 0.001220 0.6316 16.70
## 3 {flour,
## baking powder} => {sugar} 0.001017 0.5556 16.41
## 4 {ham,
## processed cheese} => {white bread} 0.001932 0.6333 15.05
## 5 {whole milk,
## Instant food products} => {hamburger meat} 0.001525 0.5000 15.04
## 6 {other vegetables,
## curd,
## yogurt,
## whipped/sour cream} => {cream cheese } 0.001017 0.5882 14.83
## 7 {processed cheese,
## domestic eggs} => {white bread} 0.001118 0.5238 12.44
## 8 {tropical fruit,
## other vegetables,
## yogurt,
## white bread} => {butter} 0.001017 0.6667 12.03
## 9 {hamburger meat,
## yogurt,
## whipped/sour cream} => {butter} 0.001017 0.6250 11.28
## 10 {tropical fruit,
## other vegetables,
## whole milk,
## yogurt,
## domestic eggs} => {butter} 0.001017 0.6250 11.28
利用矩阵的方式可视化显示分组结果
plot(rules, method="grouped")
只看置信度大于80%的规则,并用矩阵的方式显示
subrules <- rules[quality(rules)$confidence > 0.8]
plot(subrules, method="grouped")
Graph-based visualizations 基于图形的可视化方法 选取lift最高的10个规则
subrules2 <- head(sort(rules, by="lift"), 10)
plot(subrules2, method="graph")
plot(subrules2, method="graph",control=list(type="items"))
Parallel coordinates plot
平行坐标图用于可视化多维度的数据, 每个维度在x轴和y轴分别显示 下图显示了10个规则的平行坐标图,箭头的宽度表示支持度,颜色深度表示置信度。
reorder 重新排列的步骤如下: 1. 随机选择两个项目,如果提升目标函数(objective function)则交换 2. 重复第一步直到达到预定的尝试次数后也没有任何改进。
plot(subrules2, method="paracoord")
plot(subrules2, method="paracoord", control=list(reorder=TRUE))
Doubledecker
Doubledecker图用于分析单独一条规则
oneRule <- sample(rules, 1)
inspect(oneRule)
## lhs rhs support confidence lift
## 1 {citrus fruit,
## pip fruit,
## yogurt} => {tropical fruit} 0.001627 0.5 4.765
plot(oneRule, method="doubledecker", data = Groceries)