Sequence Mining

关联规则可以分析出相关联的items，关联规则没有考虑items 之间的顺序。关联规则的一个衍生就是序列挖掘Sequence Mining，序列挖掘用于发现对象之间共享的一组模式，这些对象之间具有特定的顺序。

例如买了A的人之后有80%的概率买B，这样的信息可以用于推荐系统当中。序列挖掘的另一个应用领域是信息检索系统中的 Web 点击日志分析，在这种情况下，可以通过分析用户在搜索或浏览特定信息时暴露的交互序列来优化系统性能。当我们考虑到工业搜索引擎以查询日志形式获得的海量数据时，这种用法就变得尤为明显。在生物学中，频繁序列挖掘可用于提取隐藏在 DNA 序列中的信息。

数据

首先我们可能需要进行数据的转换

arulesSequences

实现了cSPADE算法

library(arulesSequences)

## Loading required package: arules

## Warning: package 'arules' was built under R version 4.1.2

## Loading required package: Matrix

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## 
## Attaching package: 'arulesSequences'

## The following object is masked from 'package:arules':
## 
##     itemsets

library(papeR)

## Loading required package: car

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:arules':
## 
##     recode

## Loading required package: xtable

## Registered S3 method overwritten by 'papeR':
##   method    from
##   Anova.lme car

## 
## Attaching package: 'papeR'

## The following object is masked from 'package:utils':
## 
##     toLatex

data(zaki)
s0 <- cspade(zaki, parameter = list(support = 0.3),
                   control   = list(verbose = TRUE))

## 
## parameter specification:
## support : 0.3
## maxsize :  10
## maxlen  :  10
## 
## algorithmic control:
## bfstype  : FALSE
## verbose  :  TRUE
## summary  : FALSE
## tidLists : FALSE
## 
## preprocessing ... 1 partition(s), 0 MB [0.046s]
## mining transactions ... 0 MB [0.016s]
## reading sequences ... [0.014s]
## 
## total elapsed time: 0.076s

summary(s0)

## set of 18 sequences with
## 
## most frequent items:
##       A       B       F       D (Other) 
##      11      10      10       8      28 
## 
## most frequent elements:
##     {A}     {D}     {B}     {F}   {B,F} (Other) 
##       8       8       4       4       4       3 
## 
## element (sequence) size distribution:
## sizes
## 1 2 3 
## 8 7 3 
## 
## sequence length distribution:
## lengths
## 1 2 3 4 
## 4 8 5 1 
## 
## summary of quality measures:
##     support      
##  Min.   :0.5000  
##  1st Qu.:0.5000  
##  Median :0.5000  
##  Mean   :0.6528  
##  3rd Qu.:0.7500  
##  Max.   :1.0000  
## 
## includes transaction ID lists: FALSE 
## 
## mining info:
##  data ntransactions nsequences support
##  zaki            10          4     0.3

查看规则

# Get induced temporal rules from frequent itemsets
r1 <- as(ruleInduction(s0, confidence = 0.05, control = list(verbose = TRUE)), "data.frame")


r1

##                    rule support confidence lift
## 1        <{D}> => <{F}>     0.5        1.0  1.0
## 2      <{D}> => <{B,F}>     0.5        1.0  1.0
## 3        <{D}> => <{B}>     0.5        1.0  1.0
## 4        <{B}> => <{A}>     0.5        0.5  0.5
## 5        <{D}> => <{A}>     0.5        1.0  1.0
## 6        <{F}> => <{A}>     0.5        0.5  0.5
## 7    <{D},{F}> => <{A}>     0.5        1.0  1.0
## 8      <{B,F}> => <{A}>     0.5        0.5  0.5
## 9  <{D},{B,F}> => <{A}>     0.5        1.0  1.0
## 10   <{D},{B}> => <{A}>     0.5        1.0  1.0

另外，TraMineR是一个进行序列数据可视化的包，可以使用这个R包对序列数据进行可视化。

Sequence Mining

MiLin

4/15/2022

数据

arulesSequences