分析角度

分析的问题：

日志中有多少案例（或流程实例） case/trace
日志志中有多少任务 event
日志中有多少资源
有多少activity

library(bupaR)

## 
## Attaching package: 'bupaR'

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:utils':
## 
##     timestamp

patients

## # Log of 5442 events consisting of:
## 7 traces 
## 500 cases 
## 2721 instances of 7 activities 
## 7 resources 
## Events occurred from 2017-01-02 11:41:53 until 2018-05-05 07:16:02 
##  
## # Variables were mapped as follows:
## Case identifier:     patient 
## Activity identifier:     handling 
## Resource identifier:     employee 
## Activity instance identifier:    handling_id 
## Timestamp:           time 
## Lifecycle transition:        registration_type 
## 
## # A tibble: 5,442 × 7
##    handling     patient employee handling_id regist…¹ time                .order
##    <fct>        <chr>   <fct>    <chr>       <fct>    <dttm>               <int>
##  1 Registration 1       r1       1           start    2017-01-02 11:41:53      1
##  2 Registration 2       r1       2           start    2017-01-02 11:41:53      2
##  3 Registration 3       r1       3           start    2017-01-04 01:34:05      3
##  4 Registration 4       r1       4           start    2017-01-04 01:34:04      4
##  5 Registration 5       r1       5           start    2017-01-04 16:07:47      5
##  6 Registration 6       r1       6           start    2017-01-04 16:07:47      6
##  7 Registration 7       r1       7           start    2017-01-05 04:56:11      7
##  8 Registration 8       r1       8           start    2017-01-05 04:56:11      8
##  9 Registration 9       r1       9           start    2017-01-06 05:58:54      9
## 10 Registration 10      r1       10          start    2017-01-06 05:58:54     10
## # … with 5,432 more rows, and abbreviated variable name ¹registration_type

一共有7条trace，有500个case，有7种activities，有7种resource。时间范围是：2017-01-02 11:41:53 until 2018-05-05 07:16:02

框架只需要了解这一幅图就可以了.

角度1:organization 组织角度

通常在流程中，执行人记录在resource这个特征当中。感兴趣的问题包括： 1. Who executes the work 哪些人执行哪些activity 2. who specialize in certain work 哪些人专门执行哪些活动 3. is there a risk of brain drain 是否有人员流失风险 4. who transfer work to whom

resource 频率

library(edeaR)
library(bupaR)
# 查看整个event log 中，source 的分布
resource_frequency(sepsis,level = "log") %>% head()

## # A tibble: 1 × 8
##     min    q1 median  mean    q3   max st_dev   iqr
##   <int> <dbl>  <dbl> <dbl> <dbl> <int>  <dbl> <dbl>
## 1     1  33.5     58  585.  206.  8111  1684.  173.

# 查看不同的case中，有多少不同的resource以及其次数分布. 例如看caseid 为A 的列，这个case 出现了5个不同的source，这5个source中，有个source出现次数最多15此。其他的统计量都是根据不同的source出现次数计算出来的。验证方法很简单，查看某一个case 的resource，进行统计
resource_frequency(sepsis,level = "case")%>% head()

## # A tibble: 6 × 11
##   case_id nr_of_resour…¹   min    q1  mean median    q3   max st_dev   iqr total
##   <chr>            <int> <int> <dbl> <dbl>  <dbl> <dbl> <int>  <dbl> <dbl> <int>
## 1 A                    5     1     1  4.4       1  4       15   6.07  3       22
## 2 AA                   3     1     2  2.67      3  3.5      4   1.53  1.5      8
## 3 AAA                  6     1     1  1.83      1  2.5      4   1.33  1.5     11
## 4 AB                   3     1     2  2.67      3  3.5      4   1.53  1.5      8
## 5 ABA                  5     1     1  3.4       1  4       10   3.91  3       17
## 6 AC                   6     1     1  2.17      1  3.25     5   1.83  2.25    13
## # … with abbreviated variable name ¹nr_of_resources

# 查看某个activity 出现了多少个不同的resource，以及其次数的分布
resource_frequency(sepsis,level = "activity")%>% head()

## # A tibble: 6 × 11
##   activity     nr_of…¹   min    q1   mean median     q3   max st_dev   iqr total
##   <fct>          <int> <int> <dbl>  <dbl>  <dbl>  <dbl> <int>  <dbl> <dbl> <int>
## 1 Admission IC       4     1    7    29.2   31     53.2    54   28.2  46.2   117
## 2 Admission NC      20     1   17    59.1   40.5   68.2   216   62.7  51.2  1182
## 3 CRP                1  3262 3262  3262   3262   3262    3262   NA     0    3262
## 4 ER Registra…       2    65  295   525    525    755     985  651.  460    1050
## 5 ER Sepsis T…       2    65  295.  524.   524.   754.    984  650.  460.   1049
## 6 ER Triage          1  1053 1053  1053   1053   1053    1053   NA     0    1053
## # … with abbreviated variable name ¹nr_of_resources

# 统计resource的次数以及占比,占比是次数/总次数
resource_frequency(sepsis,level = "resource")%>% head()

## # A tibble: 6 × 3
##   resource absolute relative
##   <fct>       <int>    <dbl>
## 1 B            8111   0.533 
## 2 A            3462   0.228 
## 3 C            1053   0.0692
## 4 E             782   0.0514
## 5 ?             294   0.0193
## 6 F             216   0.0142

# 统计activity-resource对出现的次数，比例.其中relative_resource的值表示resource.A-activity.A出现的次数除以resource.A出现的次数.
resource_frequency(sepsis,level = "resource-activity")%>% head()

## # A tibble: 6 × 5
##   resource activity         absolute relative_resource relative_activity
##   <fct>    <fct>               <int>             <dbl>             <dbl>
## 1 B        Leucocytes           3383             0.417             1    
## 2 B        CRP                  3262             0.402             1    
## 3 B        LacticAcid           1466             0.181             1    
## 4 C        ER Triage            1053             1                 1    
## 5 A        ER Registration       985             0.285             0.938
## 6 A        ER Sepsis Triage      984             0.284             0.938

通过以上的分析，可以从不同的维度了解资源的分布情况

Resource Involvement

资源参与度

# 计算每个案例中执行活动的不同资源的绝对数量和相对数量，以了解哪些案例由少量资源处理，哪些案例需要更多资源，这表明流程中的差异程度较高。
# 绝对数量就是这个case出现了多少不同的resource，相对数量是这个case出现的resour# ce种类占据resource总种类的比例. caseid 为A的case, 出现了5种不同的resource,占比5/26 .
resource_involvement(sepsis,level = "case")%>% head()

## # A tibble: 6 × 3
##   case_id absolute relative
##   <chr>      <int>    <dbl>
## 1 ES             9    0.346
## 2 GN             9    0.346
## 3 MN             9    0.346
## 4 NGA            9    0.346
## 5 PGA            9    0.346
## 6 XI             9    0.346

# 该度量提供了每种资源所涉及的案例（注意是参与的案例适量）的绝对数量和相对数量，表明流程中哪些资源比其他资源更“必要”。
# 绝对数量就是出现次数，相对数量就是出现次数除以出现次数最多的resource的次数
resource_involvement(sepsis,level = "resource")%>% head()

## # A tibble: 6 × 3
##   resource absolute relative
##   <fct>       <int>    <dbl>
## 1 C            1050    1    
## 2 B            1013    0.965
## 3 A             985    0.938
## 4 E             782    0.745
## 5 ?             294    0.28 
## 6 F             200    0.190

# resource-activity 的计算很有问题

resource specialisation

分析资源是否专门用于特定活动。该度量可以概括哪些资源比其他资源执行某些活动更多

# 从整体分析resource 处理几个activity，从结果来看，最多情况下，一个resource会处理5个不同的activity
resource_specialisation(log = sepsis,level = "log")%>% head()

## # A tibble: 1 × 8
##     min    q1 median  mean    q3   max st_dev   iqr
##   <int> <dbl>  <dbl> <dbl> <dbl> <int>  <dbl> <dbl>
## 1     1     1      1  1.62     2     5   1.13     1

# 此度量提供了完整日志中执行此活动的不同资源的绝对和相对数量的概览
# 其中绝对频率计算有多少不同的resource 执行了该activity，这些resource数量占总resource数量的比例
resource_specialisation(log = sepsis,level = "activity")%>% head()

## # A tibble: 6 × 3
##   activity         absolute relative
##   <fct>               <int>    <dbl>
## 1 Admission NC           20   0.769 
## 2 Admission IC            4   0.154 
## 3 ER Registration         2   0.0769
## 4 ER Sepsis Triage        2   0.0769
## 5 IV Antibiotics          2   0.0769
## 6 IV Liquid               2   0.0769

# 分析不同resource 参与了多少不同的activity
# 绝对数量就是参与的activity数量，相对比例是activity数量对于总数的占比

resource_specialisation(log = sepsis,level = "resource")%>% head()

## # A tibble: 6 × 3
##   resource absolute relative
##   <fct>       <int>    <dbl>
## 1 E               5    0.312
## 2 A               4    0.25 
## 3 L               4    0.25 
## 4 B               3    0.188
## 5 J               2    0.125
## 6 K               2    0.125

分析 handover of work

creating a resource map of an event log based on handover of work.

resource_map(log = sepsis)

Resource precedence matrix 资源矩阵函数与优先矩阵函数具有相同的工作原理，为工作交接提供了一种更为紧凑的表示方法。

sepsis %>%
    resource_matrix() %>%
    plot()

角度2: control flow

entry and exit points
length of case/trace coverage
presence of activity
rework

一些可视化方法： 1. process map 2. trace explorer 3. precedence matrix

分析开始活动与结束活动

# 查看event log中有start activity的数量和占比
start_activities(log = sepsis,level = "log")%>% head()

## # A tibble: 1 × 2
##   absolute relative
##      <int>    <dbl>
## 1        6    0.375

# 获取不同的case 的开始活动
start_activities(log = sepsis,level = "case")%>% head()

## # A tibble: 6 × 2
##   case_id activity       
##   <chr>   <fct>          
## 1 A       ER Registration
## 2 AA      ER Registration
## 3 AAA     ER Registration
## 4 AB      ER Registration
## 5 ABA     ER Registration
## 6 AC      ER Registration

# 分析所有起始活动的数量和占比(这个占比是相对于所有起始活动的)
start_activities(log = sepsis,level = "activity")%>% head()

## # A tibble: 6 × 4
##   activity         absolute relative cum_sum
##   <fct>               <int>    <dbl>   <dbl>
## 1 ER Registration       995  0.948     0.948
## 2 Leucocytes             18  0.0171    0.965
## 3 IV Liquid              14  0.0133    0.978
## 4 CRP                    10  0.00952   0.988
## 5 ER Sepsis Triage        7  0.00667   0.994
## 6 ER Triage               6  0.00571   1

# 分析哪些resource 执行了第一个活动，其数量和占比
start_activities(log = sepsis,level = "resource")%>% head()

## # A tibble: 4 × 4
##   resource absolute relative cum_sum
##   <fct>       <int>    <dbl>   <dbl>
## 1 A             954  0.909     0.909
## 2 L              62  0.0590    0.968
## 3 B              28  0.0267    0.994
## 4 C               6  0.00571   1

# 分析起始的resource activity比的数量和比例
start_activities(log = sepsis,level = "resource-activity")%>% head()

## # A tibble: 6 × 5
##   resource activity         absolute relative cum_sum
##   <fct>    <fct>               <int>    <dbl>   <dbl>
## 1 A        ER Registration       933  0.889     0.889
## 2 L        ER Registration        62  0.0590    0.948
## 3 B        Leucocytes             18  0.0171    0.965
## 4 A        IV Liquid              14  0.0133    0.978
## 5 B        CRP                    10  0.00952   0.988
## 6 A        ER Sepsis Triage        7  0.00667   0.994

# 结束活动的分析类似 
end_activities(log = sepsis,level = "log")%>% head()

## # A tibble: 1 × 2
##   absolute relative
##      <int>    <dbl>
## 1       14    0.875

Trace Coverage 覆盖率

使用跟踪频率分析日志的结构。什么叫做覆盖率，就是trace出现的次数

# coverage 是trace 的分布，
trace_coverage(log = sepsis,level = "log")%>% head()

## # A tibble: 1 × 8
##        min       q1   median    mean       q3    max  st_dev   iqr
##      <dbl>    <dbl>    <dbl>   <dbl>    <dbl>  <dbl>   <dbl> <dbl>
## 1 0.000952 0.000952 0.000952 0.00118 0.000952 0.0333 0.00168     0

# 分析不同的trace的占比，35表示这条trace出现了35次，占比就是35/总的trace数量
trace_coverage(log = sepsis,level = "trace")%>% head()

## # A tibble: 6 × 4
##   trace                                                  absol…¹ relat…² cum_sum
##   <chr>                                                    <int>   <dbl>   <dbl>
## 1 ER Registration,ER Triage,ER Sepsis Triage                  35 0.0333   0.0333
## 2 ER Registration,ER Triage,ER Sepsis Triage,Leucocytes…      24 0.0229   0.0562
## 3 ER Registration,ER Triage,ER Sepsis Triage,CRP,Leucoc…      22 0.0210   0.0771
## 4 ER Registration,ER Triage,ER Sepsis Triage,CRP,Lactic…      13 0.0124   0.0895
## 5 ER Registration,ER Triage,ER Sepsis Triage,Leucocytes…      11 0.0105   0.1   
## 6 ER Registration,ER Triage,ER Sepsis Triage,Leucocytes…       9 0.00857  0.109 
## # … with abbreviated variable names ¹absolute, ²relative

# 与上面类似，不过显示除了不同的case 对应的trace
trace_coverage(log = sepsis,level = "case")%>% head()

## # A tibble: 6 × 4
##   case_id trace                                      absolute relative
##   <chr>   <chr>                                         <int>    <dbl>
## 1 Q       ER Registration,ER Triage,ER Sepsis Triage       35   0.0333
## 2 SKA     ER Registration,ER Triage,ER Sepsis Triage       35   0.0333
## 3 HU      ER Registration,ER Triage,ER Sepsis Triage       35   0.0333
## 4 YU      ER Registration,ER Triage,ER Sepsis Triage       35   0.0333
## 5 CDA     ER Registration,ER Triage,ER Sepsis Triage       35   0.0333
## 6 HMA     ER Registration,ER Triage,ER Sepsis Triage       35   0.0333

trace length

迹线长度分析此度量提供了每个跟踪中发生的活动数量的概述。重要的一点是，该度量考虑了活动的每个实例，但不考虑单个生命周期事件。

# 从整个eventlog 来看
trace_length(log = sepsis,level = "log")%>% head()

## # A tibble: 1 × 8
##     min    q1 median  mean    q3   max st_dev   iqr
##   <int> <dbl>  <dbl> <dbl> <dbl> <int>  <dbl> <dbl>
## 1     3     9     13  14.5    16   185   11.5     7

# 显示不同trace的长度
# 第三列是出现次数的占比. 第四列不太清楚具体含义,约等于第二列的值除以第二列的和
trace_length(log = sepsis,level = "trace")%>% head()

## # A tibble: 6 × 4
##   trace                                                  absol…¹ relat…² relat…³
##   <chr>                                                    <int>   <dbl>   <dbl>
## 1 ER Registration,ER Triage,ER Sepsis Triage,IV Liquid,…     185 9.52e-4   13.2 
## 2 ER Registration,ER Triage,ER Sepsis Triage,IV Liquid,…     170 9.52e-4   12.1 
## 3 ER Registration,CRP,Leucocytes,ER Triage,ER Sepsis Tr…     118 9.52e-4    8.43
## 4 Leucocytes,CRP,ER Registration,ER Triage,ER Sepsis Tr…      88 9.52e-4    6.29
## 5 ER Registration,ER Triage,ER Sepsis Triage,IV Liquid,…      84 9.52e-4    6   
## 6 IV Liquid,ER Registration,ER Triage,ER Sepsis Triage,…      66 9.52e-4    4.71
## # … with abbreviated variable names ¹absolute, ²relative_trace_frequency,
## #   ³relative_to_median

# 查看不同case id 的trace 长度
trace_length(log = sepsis,level = "case")%>% head()

## # A tibble: 6 × 2
##   case_id absolute
##   <chr>      <int>
## 1 NGA          185
## 2 KM           170
## 3 OD           118
## 4 GK            88
## 5 YX            84
## 6 ZMA           66

分析repetitions：repeat 和 redo

提供重复次数的信息统计重复是指在一个案例中执行一个活动，而该活动之前已经执行过，但在其间执行了一个或多个其他活动。 repeat 和redo 的区别是？文档中有说明：

repeat ：repetitions are activity executions of the same activity type that are executed not immediately following each other, but by the same resource.大意就是说，活动重复出现，但是不相连，并且由同一resource 执行
redo : repetitions are activity executions of the same activity type that are executed not immediately following each other and by a different resource than the first activity occurrence of this activity type. resource与第一个执行该activity的resource不同。

# 度量显示了一个案例中重复次数的汇总统计信息.由同一资源执行的同一活动的两个或多个事件的每一组合，不是紧接着执行的，则被视为该活动的一次重复。
# 什么叫做  repetition，2个或者多个相同的活动被同一个resource处理，但是这些活动并不是连着.

# 总的统计
number_of_repetitions(log = sepsis,level = "log")%>% head()

## # A tibble: 1 × 8
##     min    q1 median  mean    q3   max st_dev   iqr
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1     0     0      2  1.64     3     5   1.28     3

# 查看不同case中的重复情况。relative的值计算方式是重复的次数/trace长度，例如KM有5个重复，trace长度为170 = 5/170=0.02941176
number_of_repetitions(log = sepsis,level = "case")%>% head()

##   case_id absolute   relative
## 1      KM        5 0.02941176
## 2      MN        5 0.15151515
## 3      XI        5 0.12820513
## 4      AD        4 0.13793103
## 5      AE        4 0.22222222
## 6     BFA        4 0.19047619

# 查看不同活动的的重复情况，relative 是activity的重复次数/出现总次数.例如CRP重复次数692，一共3262，692/3262=0.2121398
number_of_repetitions(log = sepsis,level = "activity")%>% head()

##       activity absolute    relative
## 1          CRP      692 0.212139792
## 2   Leucocytes      674 0.199231451
## 3 Admission NC      177 0.149746193
## 4   LacticAcid      170 0.115961801
## 5 Admission IC        6 0.051282051
## 6    ER Triage        3 0.002849003

# 查看resource的重复次数以及比例
number_of_repetitions(log = sepsis,level = "resource")%>% head()

##   first_resource absolute   relative
## 1              B     1536 0.18937246
## 2              G       67 0.45270270
## 3              F       16 0.07407407
## 4              R       13 0.22807018
## 5              I       12 0.09523810
## 6              Q       11 0.17460317

# 查看resource-activity的重复情况
number_of_repetitions(log = sepsis,level = "resource-activity")%>% head()

##   resource     activity absolute relative_activity relative_resource
## 1        B          CRP      692       0.212139792       0.085316237
## 2        B   Leucocytes      674       0.199231451       0.083097029
## 3        B   LacticAcid      170       0.115961801       0.020959191
## 4        F Admission NC        6       0.005076142       0.027777778
## 5        C    ER Triage        3       0.002849003       0.002849003
## 6        T Admission NC        3       0.002538071       0.085714286

分析循环

提供跟踪中自循环数的信息统计信息。同一活动类型的活动实例由同一资源在彼此之后立即执行多次，它们处于自循环（“length-1-loop”）中。如果同一活动类型的活动实例被同一资源执行3次，则这被定义为“大小2的自循环”。

# 循环和重复的结果理解是类似的

# 查看整体的循环情况
number_of_selfloops(log = sepsis,level = "log")%>% head()

## # A tibble: 1 × 8
##     min    q1 median  mean    q3   max st_dev   iqr
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1     0     0      0 0.827     1    33   1.82     1

#  查看不同case 的循环
number_of_selfloops(log = sepsis,level = "case")%>% head()

##   case_id absolute  relative
## 1     NGA       33 0.1783784
## 2      KM       18 0.1058824
## 3      OD       17 0.1440678
## 4      GK       15 0.1704545
## 5      YX       11 0.1309524
## 6      JS        7 0.1794872

# 在“活动”级别上，每个活动的自循环的绝对和相对数量可以指示哪些活动在过程中造成了最大的浪费。

number_of_selfloops(log = sepsis,level = "activity")%>% head()

##          activity absolute    relative
## 1      Leucocytes      333 0.098433343
## 2             CRP      293 0.089822195
## 3    Admission NC      168 0.142131980
## 4      LacticAcid       73 0.049795362
## 5    Admission IC        1 0.008547009
## 6 ER Registration        0 0.000000000

# 在“资源”层面上，该指标可以深入了解在一个案例中哪些资源最需要重复他们的工作，或者他们所做的工作应该由同一案例中的另一个资源重做。

number_of_selfloops(log = sepsis,level = "resource")%>% head()

##   first_resource absolute   relative
## 1              B      699 0.08617926
## 2              G       67 0.45270270
## 3              N       11 0.23913043
## 4              Q       11 0.17460317
## 5              F       10 0.04629630
## 6              O       10 0.05376344

# 在“资源活动”级别上，该度量可用于了解哪些活动对哪些资源最重要。

number_of_selfloops(log = sepsis,level = "resource-activity")%>% head()

##   first_resource     activity absolute relative_activity relative_resource
## 1              B   Leucocytes      333       0.098433343       0.041055357
## 2              B          CRP      293       0.089822195       0.036123783
## 3              B   LacticAcid       73       0.049795362       0.009000123
## 4              G Admission NC       67       0.056683587       0.452702703
## 5              N Admission NC       11       0.009306261       0.239130435
## 6              Q Admission NC       11       0.009306261       0.174603175

分析activity的出现在多少case 当中

计算每种活动类型的事例百分比。

# 这个数量是指出现在多少个case 当中，以及出现次数/总case数量。这个和activity 的频率不一样的
activity_presence(log = sepsis)%>% head()

## # A tibble: 6 × 3
##   activity         absolute relative
##   <fct>               <int>    <dbl>
## 1 ER Registration      1050    1    
## 2 ER Triage            1050    1    
## 3 ER Sepsis Triage     1049    0.999
## 4 Leucocytes           1012    0.964
## 5 CRP                  1007    0.959
## 6 LacticAcid            860    0.819

# 计算activity的频率
activity_frequency(log = sepsis,level = "activity")%>% head()

## # A tibble: 6 × 3
##   activity        absolute relative
##   <fct>              <int>    <dbl>
## 1 Leucocytes          3383   0.222 
## 2 CRP                 3262   0.214 
## 3 LacticAcid          1466   0.0964
## 4 Admission NC        1182   0.0777
## 5 ER Triage           1053   0.0692
## 6 ER Registration     1050   0.0690

可视化

process map

process_map(log = sepsis)

trace explor

可视化日志中的不同活动序列。使用type参数，它可以用于探索频繁和不频繁的跟踪。覆盖率参数指定了您要探索的日志的大小。

trace_explorer(log = sepsis,coverage = 0.2)

# coverage ,trace数量,累计占比

优先级矩阵，显示活动前后顺序的。

precedence_matrix(sepsis)

## Warning: `precedence_matrix()` was deprecated in processmapR 0.4.0.
## ℹ Please use `process_matrix()` instead.

## # A tibble: 135 × 3
##    antecedent       consequent         n
##    <fct>            <fct>          <int>
##  1 ER Sepsis Triage Admission IC       1
##  2 Release A        CRP                1
##  3 Admission NC     IV Antibiotics     2
##  4 Admission NC     ER Triage          1
##  5 LacticAcid       Release B          4
##  6 IV Liquid        Release A          2
##  7 Leucocytes       Return ER          1
##  8 Release A        Leucocytes         1
##  9 Admission NC     Release D          1
## 10 IV Liquid        Admission IC       3
## # … with 125 more rows

plot(precedence_matrix(sepsis))

调度3:时间角度

吞吐时间

提供有关案例吞吐量时间的汇总统计信息。开始到结束的时间就是吞吐时间,

# 整个eventlog 中吞吐时间的分布
throughput_time(log = sepsis,level = "log")%>% head()

## # A tibble: 1 × 8
##   min           q1            median        mean        q3    max   st_dev iqr  
##   <drtn>        <drtn>        <drtn>        <drtn>      <drt> <drt>  <dbl> <drt>
## 1 2.033333 mins 926.5875 mins 7694.475 mins 40995.85 m… 2591… 6081… 87174. 2498…

# 不同trace 吞吐时间的分布.这个相对占比应该是该trace的数量除以总数量
throughput_time(log = sepsis,level = "trace")%>% head()

## # A tibble: 6 × 11
##   trace          relat…¹ min   q1    mean  median q3    max   st_dev   iqr total
##   <chr>            <dbl> <drt> <drt> <drt> <drtn> <drt> <drt>  <dbl> <dbl> <drt>
## 1 ER Registrati… 0.0333    2.…   8.…  24.…  17.3…  30.… 105.…   24.5  21.8  848…
## 2 ER Registrati… 0.0229    7.…  17.…  33.…  27.0…  47.…  76.…   21.2  29.5  808…
## 3 ER Registrati… 0.0210    6.…  15.…  27.…  21.3…  31.…  76.…   18.3  16.6  605…
## 4 ER Registrati… 0.0124   19.…  45.… 122.…  78.4… 189.… 382.…  105.  144.  1588…
## 5 ER Registrati… 0.0105    6.…  19.…  30.…  25.8…  32.…  78.…   20.2  13.2  332…
## 6 ER Registrati… 0.00857 130.… 159.… 214.… 170.7… 244.… 373.…   79.0  85.4 1931…
## # … with abbreviated variable name ¹relative_trace_frequency

# 查看不同case 的吞吐时间

throughput_time(log = sepsis,level = "case")%>% head()

##   case_id throughput_time
## 1      HG   608146.5 mins
## 2     VAA   607486.8 mins
## 3      ZS   574450.1 mins
## 4      LS   571634.5 mins
## 5     UBA   557236.7 mins
## 6     HDA   525518.6 mins

处理时间

提供有关进程处理时间的摘要统计信息。它只能在活动实例有可用的开始和结束时间戳时计算。

# 显示整个event log 每个case 处理时间的分布

processing_time(log = sepsis,level = "log")%>% head()

## # A tibble: 1 × 8
##   min    q1     median mean   q3     max    st_dev iqr   
##   <drtn> <drtn> <drtn> <drtn> <drtn> <drtn>  <dbl> <drtn>
## 1 0 secs 0 secs 0 secs 0 secs 0 secs 0 secs      0 0 secs

# 显示trace

processing_time(log = sepsis,level = "trace")%>% head()

## # A tibble: 6 × 11
##   trace          relat…¹ min   q1    mean  median q3    max   st_dev   iqr total
##   <chr>            <dbl> <drt> <drt> <drt> <drtn> <drt> <drt>  <dbl> <dbl> <drt>
## 1 ER Registrati… 0.0333  0 se… 0 se… 0 se… 0 secs 0 se… 0 se…      0     0 0 se…
## 2 ER Registrati… 0.0229  0 se… 0 se… 0 se… 0 secs 0 se… 0 se…      0     0 0 se…
## 3 ER Registrati… 0.0210  0 se… 0 se… 0 se… 0 secs 0 se… 0 se…      0     0 0 se…
## 4 ER Registrati… 0.0124  0 se… 0 se… 0 se… 0 secs 0 se… 0 se…      0     0 0 se…
## 5 ER Registrati… 0.0105  0 se… 0 se… 0 se… 0 secs 0 se… 0 se…      0     0 0 se…
## 6 ER Registrati… 0.00857 0 se… 0 se… 0 se… 0 secs 0 se… 0 se…      0     0 0 se…
## # … with abbreviated variable name ¹relative_frequency

# 其他可选包括activity,resource,resource-activity

空闲时间

它只能在活动实例有可用的开始和结束时间戳时计算. 结果类似。

可视化

dotted_chart(sepsis)

LogbuildR

尽管bupaR具有从XES文件中读取事件日志的功能,但从业人员通常必须从原始数据开始，并确保将其正确转换为事件日志。

LogbuildR它提供了一个图形界面，引导用户通过不同的步骤来构建事件日志.

# devtools::install_github("bupaverse/logbuildR")

其中调用build_log函数会出现一个shiny程序. 目前在测试的时候,设置时间的时候总是会出错。

daqapo

这个包用于检查event log 的质量。这个包的作者们提供了比较详细的文档，通过这个包即可查看。

https://cran.r-project.org/web/packages/daqapo/vignettes/Introduction-to-DaQAPO.html

processcheckR

processcheckR 的目标是支持基于规则的一致性检查。目前可以检查以下声明性规则：

基数规则

absent：activity does not occur more than n - 1 times,
contains：ctivity occurs n times or more,
contains_between：activity occurs between n and m times,
contains_exactly：activity occurs exactly n times.

顺序规则

starts: case starts with activity,
ends: case ends with activity,
succession: if activity A happens, B should happen after. If B happens, A should have happened before,
response: if activity A happens, B should happen after,
precedence: if activity B happens, A should have happened before,
responded_existence: if activity A happens, B should also (have) happen(ed) (i.e. before or after A).

排他规则

and: two activities always exist together,
xor: two activities are not allowed to exist together.

library(bupaR)
 
library(processcheckR)

## 
## Attaching package: 'processcheckR'

## The following object is masked from 'package:base':
## 
##     xor

sepsis %>%
  # Check if cases starts with "ER Registration".
  check_rule(starts("ER Registration"), label = "r1") %>%
  # Check if activities "CRP" and "LacticAcid" occur together.
  check_rule(and("CRP","LacticAcid"), label = "r2") %>%
  group_by(r1, r2) %>%
  n_cases()

## # A tibble: 4 × 3
##   r1    r2    n_cases
##   <lgl> <lgl>   <int>
## 1 FALSE FALSE      10
## 2 FALSE TRUE       45
## 3 TRUE  FALSE     137
## 4 TRUE  TRUE      858

HeuristicmineR

第一步，构建因果图

library(heuristicsmineR)

## 
## Attaching package: 'heuristicsmineR'

## The following object is masked from 'package:processmapR':
## 
##     precedence_matrix

library(eventdataR)
data(patients)

# Dependency graph / matrix
dependency_matrix(patients)

##                        consequent
## antecedent              Blood test Check-out Discuss Results       End
##   Blood test             0.0000000 0.0000000       0.0000000 0.0000000
##   Check-out              0.0000000 0.0000000       0.0000000 0.9979716
##   Discuss Results        0.0000000 0.9979716       0.0000000 0.0000000
##   End                    0.0000000 0.0000000       0.0000000 0.0000000
##   MRI SCAN               0.0000000 0.0000000       0.9957806 0.0000000
##   Registration           0.0000000 0.0000000       0.0000000 0.0000000
##   Start                  0.0000000 0.0000000       0.0000000 0.0000000
##   Triage and Assessment  0.9957983 0.0000000       0.0000000 0.0000000
##   X-Ray                  0.0000000 0.0000000       0.9961538 0.0000000
##                        consequent
## antecedent               MRI SCAN Registration Start Triage and Assessment
##   Blood test            0.9957806     0.000000     0              0.000000
##   Check-out             0.0000000     0.000000     0              0.000000
##   Discuss Results       0.0000000     0.000000     0              0.000000
##   End                   0.0000000     0.000000     0              0.000000
##   MRI SCAN              0.0000000     0.000000     0              0.000000
##   Registration          0.0000000     0.000000     0              0.998004
##   Start                 0.0000000     0.998004     0              0.000000
##   Triage and Assessment 0.0000000     0.000000     0              0.000000
##   X-Ray                 0.0000000     0.000000     0              0.000000
##                        consequent
## antecedent                  X-Ray
##   Blood test            0.0000000
##   Check-out             0.0000000
##   Discuss Results       0.0000000
##   End                   0.0000000
##   MRI SCAN              0.0000000
##   Registration          0.0000000
##   Start                 0.0000000
##   Triage and Assessment 0.9961832
##   X-Ray                 0.0000000
## attr(,"class")
## [1] "dependency_matrix" "matrix"            "array"

# Causal graph / Heuristics net
causal_net(patients)

## Nodes
## # A tibble: 9 × 12
##   act    from_id     n n_dis…¹ bindi…² bindi…³ label color…⁴ shape fontc…⁵ color
##   <chr>    <int> <dbl>   <dbl> <list>  <list>  <chr>   <dbl> <chr> <chr>   <chr>
## 1 Blood…       1   237     237 <list>  <list>  "Blo…     237 rect… black   black
## 2 Check…       2   492     492 <list>  <list>  "Che…     492 rect… white   black
## 3 Discu…       3   495     495 <list>  <list>  "Dis…     495 rect… white   black
## 4 End          4   500     500 <list>  <list>  "End"     500 circ… brown4  brow…
## 5 MRI S…       5   236     236 <list>  <list>  "MRI…     236 rect… black   black
## 6 Regis…       6   500     500 <list>  <list>  "Reg…     500 rect… white   black
## 7 Start        7   500     500 <list>  <list>  "Sta…     500 circ… chartr… char…
## 8 Triag…       8   500     500 <list>  <list>  "Tri…     500 rect… white   black
## 9 X-Ray        9   261     261 <list>  <list>  "X-R…     261 rect… black   black
## # … with 1 more variable: tooltip <chr>, and abbreviated variable names
## #   ¹n_distinct_cases, ²bindings_input, ³bindings_output, ⁴color_level,
## #   ⁵fontcolor
## Edges
## # A tibble: 9 × 8
##   antecedent            consequent         dep from_id to_id     n label penwi…¹
##   <chr>                 <chr>            <dbl>   <int> <int> <dbl> <chr>   <dbl>
## 1 Triage and Assessment Blood test       0.996       8     1   237 237      2.90
## 2 Discuss Results       Check-out        0.998       3     2   495 495      4.96
## 3 MRI SCAN              Discuss Results  0.996       5     3   236 236      2.89
## 4 X-Ray                 Discuss Results  0.996       9     3   261 261      3.09
## 5 Check-out             End              0.998       2     4   492 492      4.94
## 6 Blood test            MRI SCAN         0.996       1     5   237 237      2.90
## 7 Start                 Registration     0.998       7     6   500 500      5   
## 8 Registration          Triage and Asse… 0.998       6     8   500 500      5   
## 9 Triage and Assessment X-Ray            0.996       8     9   261 261      3.09
## # … with abbreviated variable name ¹penwidth

使用L_heur_1构建因果图

# Efficient precedence matrix
m <- precedence_matrix_absolute(L_heur_1)
as.matrix(m)

##           consequent
## antecedent  a  b  c  d  e End Start
##      a      0 11 11 13  5   0     0
##      b      0  0 10  0 11   0     0
##      c      0 10  0  0 11   0     0
##      d      0  0  0  4 13   0     0
##      e      0  0  0  0  0  40     0
##      End    0  0  0  0  0   0     0
##      Start 40  0  0  0  0   0     0

# Example from Process mining book
dependency_matrix(L_heur_1, threshold = .7)

##           consequent
## antecedent         a         b         c         d         e       End Start
##      a     0.0000000 0.9166667 0.9166667 0.9285714 0.8333333 0.0000000     0
##      b     0.0000000 0.0000000 0.0000000 0.0000000 0.9166667 0.0000000     0
##      c     0.0000000 0.0000000 0.0000000 0.0000000 0.9166667 0.0000000     0
##      d     0.0000000 0.0000000 0.0000000 0.8000000 0.9285714 0.0000000     0
##      e     0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.9756098     0
##      End   0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000     0
##      Start 0.9756098 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000     0
## attr(,"class")
## [1] "dependency_matrix" "matrix"            "array"

causal_net(L_heur_1, threshold = .7)

## Nodes
## # A tibble: 7 × 12
##   act   from_id     n n_dist…¹ bindi…² bindi…³ label color…⁴ shape fontc…⁵ color
##   <chr>   <int> <dbl>    <dbl> <list>  <list>  <chr>   <dbl> <chr> <chr>   <chr>
## 1 a           1    40       40 <list>  <list>  "a\n…      40 rect… white   black
## 2 b           2    21       21 <list>  <list>  "b\n…      21 rect… black   black
## 3 c           3    21       21 <list>  <list>  "c\n…      21 rect… black   black
## 4 d           4    17       13 <list>  <list>  "d\n…      17 rect… black   black
## 5 e           5    40       40 <list>  <list>  "e\n…      40 rect… white   black
## 6 End         6    40       40 <list>  <list>  "End"      40 circ… brown4  brow…
## 7 Start       7    40       40 <list>  <list>  "Sta…      40 circ… chartr… char…
## # … with 1 more variable: tooltip <chr>, and abbreviated variable names
## #   ¹n_distinct_cases, ²bindings_input, ³bindings_output, ⁴color_level,
## #   ⁵fontcolor
## Edges
## # A tibble: 10 × 8
##    antecedent consequent   dep from_id to_id     n label penwidth
##    <chr>      <chr>      <dbl>   <int> <int> <dbl> <chr>    <dbl>
##  1 Start      a          0.976       7     1    40 40         5  
##  2 a          b          0.917       1     2    21 21         3.1
##  3 a          c          0.917       1     3    21 21         3.1
##  4 a          d          0.929       1     4    13 13         2.3
##  5 d          d          0.8         4     4     4 4          1.4
##  6 a          e          0.833       1     5     5 5          1.5
##  7 b          e          0.917       2     5    21 21         3.1
##  8 c          e          0.917       3     5    21 21         3.1
##  9 d          e          0.929       4     5    13 13         2.3
## 10 e          End        0.976       5     6    40 40         5

第二步，将因果图转变成为PetriNet

# Convert to Petri net
library(petrinetR)
cn <- causal_net(L_heur_1, threshold = .7)
#cn <- causal_net(patients)
pn <- as.petrinet(cn)
render_PN(pn)

ProcessanimateR

使用移动标记的动画可以是一个强大的可视化工具，帮助理解一般的过程行为。包procesanimateR为bupaR实现了一个动画库，该库使用web标准SVG呈现交互式过程动画.

# install.packages("processanimateR")
library(processanimateR)
library(eventdataR)

基础的用法就是直接使用animate_process函数

animate_process(patients)

另外，可以修改token的美学，例如大小，形状，颜色。

animate_process(example_log, mapping = token_aes(size = token_scale(12), shape = "rect"))
animate_process(example_log, mapping = token_aes(color = token_scale("red")))

animate_process(patients, mode = "relative", jitter = 10, legend = "color",
  mapping = token_aes(color = token_scale("employee", 
    scale = "ordinal", 
    range = RColorBrewer::brewer.pal(7, "Paired"))))

更多的高级功能可以参考： https://bupaverse.github.io/processanimateR/articles/

Extensions to the bupaR ecosystem: An overview

Liam

2022-12-07