library(dplyr)
library(car)
library(ggplot2)
library(pgirmess)
library(clinfun)
source("../data/load_experiment_results.r")
Warning: The working directory was changed to C:/Users/Benoit/Insync/UHasselt/Workspaces/RStudio/processdiscoverybenchmark inside a notebook chunk. The working directory will be reset when the chunk is finished running.
As we can see below, there are 3472 experiments (= 4 Miners x 2 levels of Frequent Paths x 7 levels of reoccuring tasks x 62 logs). For each experiment, we measured the recall, precision and f1 score as the average based on 10 fold CV.
glimpse(results)
Observations: 3,472
Variables: 8
$ infrequentpaths <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE...
$ reoccuringtasks <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
$ tree <int> 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4, 5, 5, 5, 5, 6...
$ miner <fctr> Alpha +, Heuristics, ILP, Inductive, Alpha +, Heuristics, IL...
$ coverability <dbl> 9.009003, 9.009003, 9.009003, 9.009003, 9.009003, 9.009003, 9...
$ avg_recall <dbl> 1.0000000, 0.0000000, 1.0000000, 0.5502521, 0.9000000, 0.6848...
$ avg_precision <dbl> 0.4992754, 0.0000000, 1.0000000, 1.0000000, 0.4455056, 0.4458...
$ avg_f1 <dbl> 0.6660194, 0.0000000, 1.0000000, 0.7073211, 0.5959900, 0.5392...
The three plots below show the marginal distribution of recall, precision, f1.
results %>%
ggplot(aes(x=avg_recall)) +
geom_histogram(binwidth=0.01) +
ggtitle("Histogram of Recall Values")
The Recall values appear to be U-shaped distributed, with high peaks at values 0 and 1
results %>%
ggplot(aes(x=avg_precision)) +
geom_histogram(binwidth=0.01) +
ggtitle("Histogram of Precision Values")
The Precision Values have three peaks (modi), at values 0, 0.5 and 1.
results %>%
ggplot(aes(x=avg_f1)) +
geom_histogram(binwidth=0.01) +
ggtitle("Histogram of F1 Values")
F1 values are the avg between recall and precision. Consequently, we see the three modi pattern of precision reappearing.
It is clear that the three metrics are not normally distributed.
Now lets do some statistical analysis and see where we get. We start with analysing recall and apply one of the simplest analysis, a one-way ANOVA and select miner as the independent variable. The other variables are considered part of the error term.
Let’s start with recall. First we will verify if there are significant differences between the models generated by four different miners in terms of recall value.
First we check the homogeneity assumption. The Levene’s test shows that this assumption is not met and we should use Welch’s F-ratio
leveneTest(results$avg_recall, results$miner, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 352.35 < 2.2e-16 ***
3468
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
To check the normality assumption, we create an anova model and evaluate the residuals.
model1 <- aov(avg_recall ~ miner, data=results)
plot(model1, which=2)
The plot clearly illustrates that the normality assumption is not met either. So, it is best to abandon the ANOVA analysis and apply the non-parametric variant Kruskal-Wallis test. The idea behind the KW test is to rank all the observations and then test whether the average rank differs between the groups.
Therefore, we start by ranking the discovered models based on their recall value and compare the average ranking per miner.
print("Average Rank per Miner")
[1] "Average Rank per Miner"
results$rank_recall <- rank(-results$avg_recall)
by(results$rank_recall, results$miner, mean)
results$miner: Alpha +
[1] 1974.647
-------------------------------------------------------------------
results$miner: Heuristics
[1] 1810.96
-------------------------------------------------------------------
results$miner: ILP
[1] 951.0363
-------------------------------------------------------------------
results$miner: Inductive
[1] 2209.357
Based on the average (recall) ranking, these results show that ILP creates the best models, followed by Heuristics, Alpha+ and Inductive (in that order). To get an idea of the distribution of the ranks, we create some violin plots. These plots show the the density curve, mirrored around a central vertical axis, together with three horizontal lines which represent the 25%, 50% and 75% percentile. A violin plot is somehow related to a boxplot, but more suitable when the distribution of the data is multimodal.
results %>%
ggplot(aes(x= miner ,y =rank_recall)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on recall)")
These plots reveal for example that 75% of the models discovered by ILP have a higher recall value (i.e. are ranked higher) than 75% of the models discovered by Inductive miner en 50% of the models discovered by Heuristics and Alpha+.
Note, that these distributions do not tell us anything about the absolute differences in recall values. Below, one can find violin plots for the distribution of the actual recall values, which clearly show a different picture. Note that our analysis are based on the ranking of discovered models rather than the actual quaility of the models and that we will show these plots based on actual values merely for the purpose of illustration. All statistical claims will be about the ranking of the models.
results %>%
ggplot(aes(x= miner ,y =avg_recall)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of recall values")
Based on the above analysis, it appears that the four miners generate models which rank on average differently (based on recall values). We will use a Kruskal-Wallis test to see if these differences in average ranking are statistically significant.
kruskal.test(avg_recall ~ miner, results)
Kruskal-Wallis rank sum test
data: avg_recall by miner
Kruskal-Wallis chi-squared = 852.25, df = 3, p-value < 2.2e-16
These results show that there are significant ranking differences between the models created by the four miners in terms of recall. The order suggested by the plot is ILP > Heuristics > Alpha+ > Inductive. We can use a post-hoc test to see which of these differences are significant.
kruskalmc(avg_recall ~ miner, results)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
Alpha +-Heuristics 163.6878 126.9474 TRUE
Alpha +-ILP 1023.6112 126.9474 TRUE
Alpha +-Inductive 234.7091 126.9474 TRUE
Heuristics-ILP 859.9234 126.9474 TRUE
Heuristics-Inductive 398.3969 126.9474 TRUE
ILP-Inductive 1258.3203 126.9474 TRUE
It appears that all pairwise comparisons are statisically significant different from each other. Thus, based on recall values, models generated by ILP rank on average better than models generated by the Heuristics Miner, which rank on average better than models generated by the Alpha+ Miner, which subsequently rank better on average than models generated by the Inductive Miner.
Next, we want to test whether the presence/absence of infrequent behavior in the log has an impact on the average ranking based on recall values. For this we will split the data between experiments with infrequent behavior and experiments without infrequent behavior and verify the ranking for these two subsets. We will start the analysis for experiments where infrequent behavior was absent.
results_freq <- results %>%
filter(infrequentpaths==F)
results_infreq <- results %>%
filter(infrequentpaths==T)
First we will test to see if we can apply a standard ANOVA analysis
leveneTest(results_freq$avg_recall, results_freq$miner, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 387.29 < 2.2e-16 ***
1732
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_recall ~ miner, data=results_freq)
plot(model1, which=2)
Both the Levene’s test as the QQ-plot reveal that both the assumption of Homogeneity and Normality are violated. Therefore we will focus on the comparison of average rankings.
print("Average Rank per Miner")
[1] "Average Rank per Miner"
results_freq$rank_recall <- rank(-results_freq$avg_recall)
by(results_freq$rank_recall, results_freq$miner, mean)
results_freq$miner: Alpha +
[1] 1008.922
-------------------------------------------------------------------
results_freq$miner: Heuristics
[1] 938.4712
-------------------------------------------------------------------
results_freq$miner: ILP
[1] 481.4194
-------------------------------------------------------------------
results_freq$miner: Inductive
[1] 1045.188
results_freq %>%
ggplot(aes(x= miner ,y =rank_recall)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on recall)")
These analysis show a similar ranking between the four miners. However, it appears that in case of no infrequent behavior in the log, ILP outperforms the other algorithms even more. 75% of the models generated by ILP are better than 75% of the models generated by Alpha+ en Heuristics and all the models generated by ILP are better than 75% of the models generated by Inductive Miner.
We will use a Kruskal-Wallis test to see if these differences in average ranking are statistically significant.
kruskal.test(avg_recall ~ miner, results_freq)
Kruskal-Wallis rank sum test
data: avg_recall by miner
Kruskal-Wallis chi-squared = 407.62, df = 3, p-value < 2.2e-16
These results show that there are significant ranking differences between the models created by the four miners in terms of recall. The order suggested by the plot is ILP > Heuristics > Alpha+ > Inductive. We can use a post-hoc test to see which of these differences are significant.
kruskalmc(avg_recall ~ miner, results_freq)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
Alpha +-Heuristics 70.45046 89.77831 FALSE
Alpha +-ILP 527.50230 89.77831 TRUE
Alpha +-Inductive 36.26613 89.77831 FALSE
Heuristics-ILP 457.05184 89.77831 TRUE
Heuristics-Inductive 106.71659 89.77831 TRUE
ILP-Inductive 563.76843 89.77831 TRUE
It appears that all pairwise comparisons are statisically significant different from each other, except for the differences between the models generated by Alpha+ on the one hand and the models generated by Inductive and Heuristics.
First we will test to see if we can apply a standard ANOVA analysis
leveneTest(results_infreq$avg_recall, results_infreq$miner, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 138.85 < 2.2e-16 ***
1732
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_recall ~ miner, data=results_infreq)
plot(model1, which=2)
Both the Levene’s test as the QQ-plot reveal that both the assumption of Homogeneity and Normality are violated. Therefore we will focus on the comparison of average rankings.
print("Average Rank per Miner")
[1] "Average Rank per Miner"
results_infreq$rank_recall <- rank(-results_infreq$avg_recall)
by(results_infreq$rank_recall, results_infreq$miner, mean)
results_infreq$miner: Alpha +
[1] 964.6763
-------------------------------------------------------------------
results_infreq$miner: Heuristics
[1] 867.9965
-------------------------------------------------------------------
results_infreq$miner: ILP
[1] 465.9389
-------------------------------------------------------------------
results_infreq$miner: Inductive
[1] 1175.388
results_infreq %>%
ggplot(aes(x= miner ,y =rank_recall)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on recall)")
These analysis show again a similar ranking between the four miners. However, it appears that the differences between ILP on the one hand and Alpha+ and Heuristics on the other hand are less extreme than in the case of logs without infrequent behavior. Yet, it is still clear that ILP outperforms the other algorithms in terms of recall.
We will use a Kruskal-Wallis test to see if these differences in average ranking are statistically significant.
kruskal.test(avg_recall ~ miner, results_infreq)
Kruskal-Wallis rank sum test
data: avg_recall by miner
Kruskal-Wallis chi-squared = 484.1, df = 3, p-value < 2.2e-16
These results show that there are significant ranking differences between the models created by the four miners in terms of recall. The order suggested by the plot is ILP > Heuristics > Alpha+ > Inductive. We can use a post-hoc test to see which of these differences are significant.
kruskalmc(avg_recall ~ miner, results_infreq)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
Alpha +-Heuristics 96.67972 89.77831 TRUE
Alpha +-ILP 498.73733 89.77831 TRUE
Alpha +-Inductive 210.71198 89.77831 TRUE
Heuristics-ILP 402.05760 89.77831 TRUE
Heuristics-Inductive 307.39171 89.77831 TRUE
ILP-Inductive 709.44931 89.77831 TRUE
It appears that all pairwise comparisons are statisically significant different from each other.
Next, we will investigate how the performance each of the mining algorithm is influenced by the level of reoccuring tasks in the log files.
results_alpha <- results %>%
filter(miner=="Alpha +")%>%
mutate(reoccuringtasks = as.factor(reoccuringtasks))
results_heuristics <- results %>%
filter(miner=="Heuristics")%>%
mutate(reoccuringtasks = as.factor(reoccuringtasks))
results_ilp <- results %>%
filter(miner=="ILP")%>%
mutate(reoccuringtasks = as.factor(reoccuringtasks))
results_inductive <- results %>%
filter(miner=="Inductive")%>%
mutate(reoccuringtasks = as.factor(reoccuringtasks))
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_alpha$avg_recall, results_alpha$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 5.7921 6.298e-06 ***
861
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_recall ~ reoccuringtasks, data= results_alpha)
plot(model1, which=2)
Both the assumptions of normality and homogeneity appear to be violated, thus we fall back to comparing the average rankings of the models instead.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_alpha$rank_recall <- rank(-results_alpha$avg_recall)
by(results_alpha$rank_recall, results_alpha$reoccuringtasks, mean)
results_alpha$reoccuringtasks: 0
[1] 354.6653
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.05
[1] 386.4556
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.1
[1] 414.3145
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.15
[1] 436.9718
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.2
[1] 449.3226
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.25
[1] 464.9355
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.3
[1] 534.8347
results_alpha %>%
ggplot(aes(x= reoccuringtasks ,y =rank_recall)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on recall)")
These data seem to suggest that as the level of reoccuring tasks increases, the models generated by Alpha miner deteriorate. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test. The latter test whether there is a significant trend within the data (across increasing levels of reoccuringtasks)
kruskal.test(avg_recall~ reoccuringtasks, results_alpha)
Kruskal-Wallis rank sum test
data: avg_recall by reoccuringtasks
Kruskal-Wallis chi-squared = 43.546, df = 6, p-value = 9.094e-08
jonckheere.test(results_alpha$avg_recall, as.numeric(results_alpha$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 135510, p-value = 0.002
alternative hypothesis: two.sided
These analyses show that there is statistically significant negative trend in the relative quality of the generated models as the amount of reoccuring tasks increases. Let’s verify this further by comparing different levels pairwise.
kruskalmc(avg_recall ~reoccuringtasks, data=results_alpha)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
0-0.05 31.79032 96.73456 FALSE
0-0.1 59.64919 96.73456 FALSE
0-0.15 82.30645 96.73456 FALSE
0-0.2 94.65726 96.73456 FALSE
0-0.25 110.27016 96.73456 TRUE
0-0.3 180.16935 96.73456 TRUE
0.05-0.1 27.85887 96.73456 FALSE
0.05-0.15 50.51613 96.73456 FALSE
0.05-0.2 62.86694 96.73456 FALSE
0.05-0.25 78.47984 96.73456 FALSE
0.05-0.3 148.37903 96.73456 TRUE
0.1-0.15 22.65726 96.73456 FALSE
0.1-0.2 35.00806 96.73456 FALSE
0.1-0.25 50.62097 96.73456 FALSE
0.1-0.3 120.52016 96.73456 TRUE
0.15-0.2 12.35081 96.73456 FALSE
0.15-0.25 27.96371 96.73456 FALSE
0.15-0.3 97.86290 96.73456 TRUE
0.2-0.25 15.61290 96.73456 FALSE
0.2-0.3 85.51210 96.73456 FALSE
0.25-0.3 69.89919 96.73456 FALSE
While the previous analysis show that there is a statisically significant trend, the pairwise comparisons do not provide a clear picture how this trend looks like, with many comparisons statistically insignificant. Note however, that this does not prove that there is no difference, but could also be caused by a lack of power. The Multiple comparison test incorporates some kind of Bonferroni correction to correct for the fact that we are performing many tests simultaneously, which often errs on the conservative side.
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_heuristics$avg_recall, results_heuristics$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 3.2421 0.003715 **
861
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_recall ~ reoccuringtasks, data= results_heuristics)
plot(model1, which=2)
Both the assumptions of normality and homogeneity appear to be violated, thus we fall back to comparing the average rankings of the models instead.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_heuristics$rank_recall <- rank(-results_heuristics$avg_recall)
by(results_heuristics$rank_recall, results_heuristics$reoccuringtasks, mean)
results_heuristics$reoccuringtasks: 0
[1] 363.1331
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.05
[1] 431.5726
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.1
[1] 436.5242
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.15
[1] 475.2702
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.2
[1] 439.9153
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.25
[1] 438.6169
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.3
[1] 456.4677
results_heuristics %>%
ggplot(aes(x= reoccuringtasks ,y =rank_recall)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on recall)")
These data seem to suggest that as we move from no reoccuring tasks to some reoccuring tasks, the models generated by Heuristics miner deteriorate some. However, as the level of reocurring tasks increase, there appears to be no clear trend in the further deterioration of the relative quality of the models. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test.
kruskal.test(avg_recall~ reoccuringtasks, results_heuristics)
Kruskal-Wallis rank sum test
data: avg_recall by reoccuringtasks
Kruskal-Wallis chi-squared = 15.509, df = 6, p-value = 0.01665
jonckheere.test(results_heuristics$avg_recall, as.numeric(results_heuristics$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 150830, p-value = 0.008
alternative hypothesis: two.sided
These analyses show that there is indeed a statistically significant negative trend in the relative quality of the generated models as the amount of reoccuring tasks increases. Let’s verify this further by comparing different levels pairwise.
kruskalmc(avg_recall ~reoccuringtasks, data=results_heuristics)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
0-0.05 68.439516 96.73456 FALSE
0-0.1 73.391129 96.73456 FALSE
0-0.15 112.137097 96.73456 TRUE
0-0.2 76.782258 96.73456 FALSE
0-0.25 75.483871 96.73456 FALSE
0-0.3 93.334677 96.73456 FALSE
0.05-0.1 4.951613 96.73456 FALSE
0.05-0.15 43.697581 96.73456 FALSE
0.05-0.2 8.342742 96.73456 FALSE
0.05-0.25 7.044355 96.73456 FALSE
0.05-0.3 24.895161 96.73456 FALSE
0.1-0.15 38.745968 96.73456 FALSE
0.1-0.2 3.391129 96.73456 FALSE
0.1-0.25 2.092742 96.73456 FALSE
0.1-0.3 19.943548 96.73456 FALSE
0.15-0.2 35.354839 96.73456 FALSE
0.15-0.25 36.653226 96.73456 FALSE
0.15-0.3 18.802419 96.73456 FALSE
0.2-0.25 1.298387 96.73456 FALSE
0.2-0.3 16.552419 96.73456 FALSE
0.25-0.3 17.850806 96.73456 FALSE
These results seem to confirm our impression that Heuristics Miner is sensitive for the presence of reoccuring tasks, but does not seem to produce models of decreasing relative quality as the level of reoccuring tasks increases.
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_ilp$avg_recall, results_ilp$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 0.8767 0.5114
861
model1 <- aov(avg_recall ~ reoccuringtasks, data= results_ilp)
plot(model1, which=2)
Both the assumptions of normality and homogeneity appear to be violated, thus we fall back to comparing the average rankings of the models instead.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_ilp$rank_recall <- rank(-results_ilp$avg_recall)
by(results_ilp$rank_recall, results_ilp$reoccuringtasks, mean)
results_ilp$reoccuringtasks: 0
[1] 457.2177
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.05
[1] 444.5282
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.1
[1] 423.9798
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.15
[1] 435.1048
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.2
[1] 424.1008
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.25
[1] 434.8306
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.3
[1] 421.7379
results_ilp %>%
ggplot(aes(x= reoccuringtasks ,y =rank_recall)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on recall)")
These data seem to suggest that ILP is not sensitive to the level of reoccuring tasks. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test. The latter test whether there is a significant trend within the data (across increasing levels of reoccuringtasks)
kruskal.test(avg_recall~ reoccuringtasks, results_ilp)
Kruskal-Wallis rank sum test
data: avg_recall by reoccuringtasks
Kruskal-Wallis chi-squared = 3.7951, df = 6, p-value = 0.7044
jonckheere.test(results_ilp$avg_recall, as.numeric(results_ilp$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 165880, p-value = 0.106
alternative hypothesis: two.sided
These analyses confirm our impression and show that there is no statistically significant negative trend in the relative quality of the generated models as the amount of reoccuring tasks increases. The quality of models generated by ILP miner is insensitive to the level of reoccuring tasks, when measured in terms of recall values.
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_inductive$avg_recall, results_inductive$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 23.723 < 2.2e-16 ***
861
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_recall ~ reoccuringtasks, data= results_inductive)
plot(model1, which=2)
Both the assumptions of normality and homogeneity appear to be violated, thus we fall back to comparing the average rankings of the models instead.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_inductive$rank_recall <- rank(-results_inductive$avg_recall)
by(results_inductive$rank_recall, results_inductive$reoccuringtasks, mean)
results_inductive$reoccuringtasks: 0
[1] 245.2298
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.05
[1] 299.379
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.1
[1] 408.1935
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.15
[1] 497.879
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.2
[1] 518.125
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.25
[1] 496.5806
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.3
[1] 576.1129
results_inductive %>%
ggplot(aes(x= reoccuringtasks ,y =rank_recall)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on recall)")
These data seem to suggest that as the level of reoccuring tasks increases, the models generated by Inductive miner deteriorate, although the effect seems to level of as we reach higher levels of reoccuring tasks. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test. The latter test whether there is a significant trend within the data (across increasing levels of reoccuringtasks)
kruskal.test(avg_recall~ reoccuringtasks, results_inductive)
Kruskal-Wallis rank sum test
data: avg_recall by reoccuringtasks
Kruskal-Wallis chi-squared = 178.13, df = 6, p-value < 2.2e-16
jonckheere.test(results_inductive$avg_recall, as.numeric(results_inductive$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 107610, p-value = 0.002
alternative hypothesis: two.sided
These analyses show that there is statistically significant negative trend in the relative quality of the generated models as the amount of reoccuring tasks increases. Let’s verify this further by comparing different levels pairwise.
kruskalmc(avg_recall ~reoccuringtasks, data=results_inductive)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
0-0.05 54.149194 96.73456 FALSE
0-0.1 162.963710 96.73456 TRUE
0-0.15 252.649194 96.73456 TRUE
0-0.2 272.895161 96.73456 TRUE
0-0.25 251.350806 96.73456 TRUE
0-0.3 330.883065 96.73456 TRUE
0.05-0.1 108.814516 96.73456 TRUE
0.05-0.15 198.500000 96.73456 TRUE
0.05-0.2 218.745968 96.73456 TRUE
0.05-0.25 197.201613 96.73456 TRUE
0.05-0.3 276.733871 96.73456 TRUE
0.1-0.15 89.685484 96.73456 FALSE
0.1-0.2 109.931452 96.73456 TRUE
0.1-0.25 88.387097 96.73456 FALSE
0.1-0.3 167.919355 96.73456 TRUE
0.15-0.2 20.245968 96.73456 FALSE
0.15-0.25 1.298387 96.73456 FALSE
0.15-0.3 78.233871 96.73456 FALSE
0.2-0.25 21.544355 96.73456 FALSE
0.2-0.3 57.987903 96.73456 FALSE
0.25-0.3 79.532258 96.73456 FALSE
These results seem to confirm our impression. Inductive Miner is sensitive to levels for reoccuring tasks, resulting in models of lower quality (measured in terms of recall values). However, from a level of around 15% reoccuring tasks in the log, this effect seems to have reached a plateau and stays stable.
Let’s continue with precision. First we will verify if there are significant differences between the models generated by four different miners in terms of recall value.
First we check the homogeneity the normality assumption.
leveneTest(results$avg_precision, results$miner, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 103 < 2.2e-16 ***
3468
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_precision ~ miner, data=results)
plot(model1, which=2)
The analysis illustrate that both assumptions are violated, so we proceed by focusing on the relative differences and the ranks of the generated models.
Therefore, we start by ranking the discovered models based on their precision value and compare the average ranking per miner.
print("Average Rank per Miner")
[1] "Average Rank per Miner"
results$rank_precision <- rank(-results$avg_precision)
by(results$rank_precision, results$miner, mean)
results$miner: Alpha +
[1] 2184.514
-------------------------------------------------------------------
results$miner: Heuristics
[1] 2422.416
-------------------------------------------------------------------
results$miner: ILP
[1] 1002.904
-------------------------------------------------------------------
results$miner: Inductive
[1] 1336.165
Based on the average (precision) ranking, these results show again ILP creates the best models. However, the order between the other three miners is now slightly different. Based on the precision metric, Inductive miner produces relatively better models than Alpha+ which seems to outperform Heuristics miner. To get an idea of the distribution of the ranks, we create some violin plots.
results %>%
ggplot(aes(x= miner ,y =rank_precision)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on precision)")
These plots reveal a clear distinction between the models generated by ILP and Inductive Miner on the one hand and Alpha+ and Heuristics Miner on the other hand.
Note, that these distributions focus on the relative differences between the quality of the models, not the absolute differences. For purpose of illustration, one can find a violin plot below which shows the distribution of the actual precision values (instead of the ranks).
results %>%
ggplot(aes(x= miner ,y =avg_precision)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of precision values")
We will now continue focusing on the relative differences instead. Based on the above analysis, it appears that the four miners generate models which rank on average differently (based on recall values). We will use a Kruskal-Wallis test to see if these differences in average ranking are statistically significant.
kruskal.test(avg_precision ~ miner, results)
Kruskal-Wallis rank sum test
data: avg_precision by miner
Kruskal-Wallis chi-squared = 1191.9, df = 3, p-value < 2.2e-16
These results show that there precision. The order suggested by the plot is ILP > Inductive > Alpha+ > Heuristics. We can use a post-hoc test to see which of these differences are significant.
kruskalmc(avg_precision ~ miner, results)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
Alpha +-Heuristics 237.9021 126.9474 TRUE
Alpha +-ILP 1181.6106 126.9474 TRUE
Alpha +-Inductive 848.3491 126.9474 TRUE
Heuristics-ILP 1419.5127 126.9474 TRUE
Heuristics-Inductive 1086.2512 126.9474 TRUE
ILP-Inductive 333.2615 126.9474 TRUE
It appears that all pairwise comparisons are statisically significant different from each other. Thus, based on precision values, models generated by ILP rank on average better than models generated by the ILP Miner, which rank on average better than models generated by the Alpha+ Miner, which subsequently rank better on average than models generated by the Heuristics Miner.
Next, we want to test whether the presence/absence of infrequent behavior in the log has an impact on the average ranking based on precision values. For this we will split the data between experiments with infrequent behavior and experiments without infrequent behavior and verify the ranking for these two subsets. We will start the analysis for experiments where infrequent behavior was absent.
First we will test to see if we can apply a standard ANOVA analysis
leveneTest(results_freq$avg_precision, results_freq$miner, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 57.78 < 2.2e-16 ***
1732
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_precision ~ miner, data=results_freq)
plot(model1, which=2)
Both the Levene’s test as the QQ-plot reveal that both the assumption of Homogeneity and Normality are violated. Therefore we will focus on the comparison of average rankings.
print("Average Rank per Miner")
[1] "Average Rank per Miner"
results_freq$rank_precision <- rank(-results_freq$avg_precision)
by(results_freq$rank_precision, results_freq$miner, mean)
results_freq$miner: Alpha +
[1] 1091.543
-------------------------------------------------------------------
results_freq$miner: Heuristics
[1] 1246.615
-------------------------------------------------------------------
results_freq$miner: ILP
[1] 508.985
-------------------------------------------------------------------
results_freq$miner: Inductive
[1] 626.8571
results_freq %>%
ggplot(aes(x= miner ,y =rank_precision)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on precision)")
These analysis show a similar ranking between the four miners.
We will use a Kruskal-Wallis test to see if these differences in average ranking are statistically significant.
kruskal.test(avg_precision ~ miner, results_freq)
Kruskal-Wallis rank sum test
data: avg_precision by miner
Kruskal-Wallis chi-squared = 661.04, df = 3, p-value < 2.2e-16
These results show that there are significant ranking differences between the models created by the four miners in terms of precision. The order suggested by the plot is ILP > Inductive > Alpha+ > Heuristics. We can use a post-hoc test to see which of these differences are significant.
kruskalmc(avg_precision ~ miner, results_freq)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
Alpha +-Heuristics 155.0726 89.77831 TRUE
Alpha +-ILP 582.5576 89.77831 TRUE
Alpha +-Inductive 464.6855 89.77831 TRUE
Heuristics-ILP 737.6302 89.77831 TRUE
Heuristics-Inductive 619.7581 89.77831 TRUE
ILP-Inductive 117.8721 89.77831 TRUE
It appears that all pairwise comparisons are statisically significant different from each other.
First we will test to see if we can apply a standard ANOVA analysis
leveneTest(results_infreq$avg_precision, results_infreq$miner, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 56.602 < 2.2e-16 ***
1732
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_precision ~ miner, data=results_infreq)
plot(model1, which=2)
Both the Levene’s test as the QQ-plot reveal that both the assumption of Homogeneity and Normality are violated. Therefore we will focus on the comparison of average rankings.
print("Average Rank per Miner")
[1] "Average Rank per Miner"
results_infreq$rank_precision <- rank(-results_infreq$avg_precision)
by(results_infreq$rank_precision, results_infreq$miner, mean)
results_infreq$miner: Alpha +
[1] 1091.851
-------------------------------------------------------------------
results_infreq$miner: Heuristics
[1] 1177.096
-------------------------------------------------------------------
results_infreq$miner: ILP
[1] 495.6532
-------------------------------------------------------------------
results_infreq$miner: Inductive
[1] 709.3998
results_infreq %>%
ggplot(aes(x= miner ,y =rank_precision)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on precision)")
These analysis show again a similar ranking between the four miners.
We will use a Kruskal-Wallis test to see if these differences in average ranking are statistically significant.
kruskal.test(avg_precision ~ miner, results_infreq)
Kruskal-Wallis rank sum test
data: avg_precision by miner
Kruskal-Wallis chi-squared = 539.13, df = 3, p-value < 2.2e-16
These results show that there are significant ranking differences between the models created by the four miners in terms of precision. The order suggested by the plot is ILP > Inductive > Alpha+ > Heuristics. We can use a post-hoc test to see which of these differences are significant.
kruskalmc(avg_precision ~ miner, results_infreq)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
Alpha +-Heuristics 85.24424 89.77831 FALSE
Alpha +-ILP 596.19816 89.77831 TRUE
Alpha +-Inductive 382.45161 89.77831 TRUE
Heuristics-ILP 681.44240 89.77831 TRUE
Heuristics-Inductive 467.69585 89.77831 TRUE
ILP-Inductive 213.74654 89.77831 TRUE
It appears that all pairwise comparisons are statisically significant different from each other, except for the difference between Alpha+ and Heuristics.
Next, we will investigate how the performance each of the mining algorithm is influenced by the level of reoccuring tasks in the log files.
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_alpha$avg_precision, results_alpha$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 0.3516 0.9091
861
model1 <- aov(avg_recall ~ reoccuringtasks, data= results_alpha)
plot(model1, which=2)
While the assumption of homogneity seems to hold, the assumptions of normality appears to be violated. To remain consistent, we fall back to comparing the average rankings of the models instead.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_alpha$rank_precision <- rank(-results_alpha$avg_precision)
by(results_alpha$rank_precision, results_alpha$reoccuringtasks, mean)
results_alpha$reoccuringtasks: 0
[1] 319.0242
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.05
[1] 346.6169
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.1
[1] 394.3508
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.15
[1] 435.4879
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.2
[1] 483.2177
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.25
[1] 506.5887
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.3
[1] 556.2137
results_alpha %>%
ggplot(aes(x= reoccuringtasks ,y =rank_precision)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on precision)")
These data seem to suggest that as the level of reoccuring tasks increases, the models generated by Alpha miner deteriorate. However, it seems that this effect really starts to show from a level of 15% reoccuring tasks. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test.
kruskal.test(avg_precision~ reoccuringtasks, results_alpha)
Kruskal-Wallis rank sum test
data: avg_precision by reoccuringtasks
Kruskal-Wallis chi-squared = 90.762, df = 6, p-value < 2.2e-16
jonckheere.test(results_alpha$avg_precision, as.numeric(results_alpha$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 121400, p-value = 0.002
alternative hypothesis: two.sided
These analyses show that there is statistically significant negative trend in the relative quality of the generated models as the amount of reoccuring tasks increases. Let’s verify this further by comparing different levels pairwise.
kruskalmc(avg_precision ~reoccuringtasks, data=results_alpha)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
0-0.05 27.59274 96.73456 FALSE
0-0.1 75.32661 96.73456 FALSE
0-0.15 116.46371 96.73456 TRUE
0-0.2 164.19355 96.73456 TRUE
0-0.25 187.56452 96.73456 TRUE
0-0.3 237.18952 96.73456 TRUE
0.05-0.1 47.73387 96.73456 FALSE
0.05-0.15 88.87097 96.73456 FALSE
0.05-0.2 136.60081 96.73456 TRUE
0.05-0.25 159.97177 96.73456 TRUE
0.05-0.3 209.59677 96.73456 TRUE
0.1-0.15 41.13710 96.73456 FALSE
0.1-0.2 88.86694 96.73456 FALSE
0.1-0.25 112.23790 96.73456 TRUE
0.1-0.3 161.86290 96.73456 TRUE
0.15-0.2 47.72984 96.73456 FALSE
0.15-0.25 71.10081 96.73456 FALSE
0.15-0.3 120.72581 96.73456 TRUE
0.2-0.25 23.37097 96.73456 FALSE
0.2-0.3 72.99597 96.73456 FALSE
0.25-0.3 49.62500 96.73456 FALSE
These results illustrate that the quality of the models decrease significantly whenever the level of reoccuring tasks increase by 15%.
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_heuristics$avg_precision, results_heuristics$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 1.8471 0.08723 .
861
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_precision ~ reoccuringtasks, data= results_heuristics)
plot(model1, which=2)
Again, the Levene’s test is not conclusive about the violation of the homogeneity assumption, but the QQ-plot shows that the reisudals are not normally distribution. Therefore, we fall back to comparing the average rankings of the models instead.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_heuristics$rank_precision <- rank(-results_heuristics$avg_precision)
by(results_heuristics$rank_precision, results_heuristics$reoccuringtasks, mean)
results_heuristics$reoccuringtasks: 0
[1] 398.6694
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.05
[1] 416.375
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.1
[1] 449.6492
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.15
[1] 473.2984
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.2
[1] 434.3911
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.25
[1] 443.6331
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.3
[1] 425.4839
results_heuristics %>%
ggplot(aes(x= reoccuringtasks ,y =rank_precision)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on precision)")
These data seem to suggest that the precision score of models discovered by the Heuristics Miner is relatively stable against increasing levels of reoccuring tasks. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test.
kruskal.test(avg_precision~ reoccuringtasks, results_heuristics)
Kruskal-Wallis rank sum test
data: avg_precision by reoccuringtasks
Kruskal-Wallis chi-squared = 6.9349, df = 6, p-value = 0.3269
jonckheere.test(results_heuristics$avg_precision, as.numeric(results_heuristics$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 157040, p-value = 0.262
alternative hypothesis: two.sided
These analyses show that there is indeed no statistically significant trend in the relative quality of the generated models as the amount of reoccuring tasks increases.
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_ilp$avg_precision, results_ilp$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 29.985 < 2.2e-16 ***
861
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_precision ~ reoccuringtasks, data= results_ilp)
plot(model1, which=2)
Both the assumptions of normality and homogeneity appear to be violated, thus we fall back to comparing the average rankings of the models instead.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_ilp$rank_precision <- rank(-results_ilp$avg_precision)
by(results_ilp$rank_precision, results_ilp$reoccuringtasks, mean)
results_ilp$reoccuringtasks: 0
[1] 154.9677
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.05
[1] 296.2298
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.1
[1] 413.1613
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.15
[1] 441.6976
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.2
[1] 521.246
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.25
[1] 599.3185
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.3
[1] 614.879
results_ilp %>%
ggplot(aes(x= reoccuringtasks ,y =rank_precision)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on Precision)")
These data seem to suggest that the precision score of the models generated by ILP is very sensitive to the level of reoccuring tasks. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test.
kruskal.test(avg_precision~ reoccuringtasks, results_ilp)
Kruskal-Wallis rank sum test
data: avg_precision by reoccuringtasks
Kruskal-Wallis chi-squared = 330.72, df = 6, p-value < 2.2e-16
jonckheere.test(results_ilp$avg_precision, as.numeric(results_ilp$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 83914, p-value = 0.002
alternative hypothesis: two.sided
These analyses show that there is a statistically significant negative trend in the relative quality of the generated models as the amount of reoccuring tasks increases. Let’s verify this further by comparing different levels pairwise.
kruskalmc(avg_precision ~reoccuringtasks, data=results_ilp)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
0-0.05 141.26210 96.73456 TRUE
0-0.1 258.19355 96.73456 TRUE
0-0.15 286.72984 96.73456 TRUE
0-0.2 366.27823 96.73456 TRUE
0-0.25 444.35081 96.73456 TRUE
0-0.3 459.91129 96.73456 TRUE
0.05-0.1 116.93145 96.73456 TRUE
0.05-0.15 145.46774 96.73456 TRUE
0.05-0.2 225.01613 96.73456 TRUE
0.05-0.25 303.08871 96.73456 TRUE
0.05-0.3 318.64919 96.73456 TRUE
0.1-0.15 28.53629 96.73456 FALSE
0.1-0.2 108.08468 96.73456 TRUE
0.1-0.25 186.15726 96.73456 TRUE
0.1-0.3 201.71774 96.73456 TRUE
0.15-0.2 79.54839 96.73456 FALSE
0.15-0.25 157.62097 96.73456 TRUE
0.15-0.3 173.18145 96.73456 TRUE
0.2-0.25 78.07258 96.73456 FALSE
0.2-0.3 93.63306 96.73456 FALSE
0.25-0.3 15.56048 96.73456 FALSE
These results illustrate there is a clear significant trend, which somehow levels of around 25%.
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_inductive$avg_precision, results_inductive$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 28.328 < 2.2e-16 ***
861
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_precision ~ reoccuringtasks, data= results_inductive)
plot(model1, which=2)
Both the assumptions of normality and homogeneity appear to be violated, thus we fall back to comparing the average rankings of the models instead.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_inductive$rank_precision <- rank(-results_inductive$avg_precision)
by(results_inductive$rank_precision, results_inductive$reoccuringtasks, mean)
results_inductive$reoccuringtasks: 0
[1] 159.5323
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.05
[1] 302.2823
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.1
[1] 406.5565
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.15
[1] 511.8911
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.2
[1] 514.3024
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.25
[1] 557.5403
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.3
[1] 589.3952
results_inductive %>%
ggplot(aes(x= reoccuringtasks ,y =rank_precision)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on precision)")
These data seem to suggest that as the level of reoccuring tasks increases, the models generated by Inductive miner deteriorate. It is clear from the violin plot that the presence of reoccuring tasks vs the absence of reoccuring tasks makes a big difference for the precision values of models generated by Inductive Miner. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test.
kruskal.test(avg_precision~ reoccuringtasks, results_inductive)
Kruskal-Wallis rank sum test
data: avg_precision by reoccuringtasks
Kruskal-Wallis chi-squared = 293.56, df = 6, p-value < 2.2e-16
jonckheere.test(results_inductive$avg_precision, as.numeric(results_inductive$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 91378, p-value = 0.002
alternative hypothesis: two.sided
These analyses show that there is statistically significant negative trend in the relative quality of the generated models as the amount of reoccuring tasks increases. Let’s verify this further by comparing different levels pairwise.
kruskalmc(avg_precision ~reoccuringtasks, data=results_inductive)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
0-0.05 142.75000 96.73456 TRUE
0-0.1 247.02419 96.73456 TRUE
0-0.15 352.35887 96.73456 TRUE
0-0.2 354.77016 96.73456 TRUE
0-0.25 398.00806 96.73456 TRUE
0-0.3 429.86290 96.73456 TRUE
0.05-0.1 104.27419 96.73456 TRUE
0.05-0.15 209.60887 96.73456 TRUE
0.05-0.2 212.02016 96.73456 TRUE
0.05-0.25 255.25806 96.73456 TRUE
0.05-0.3 287.11290 96.73456 TRUE
0.1-0.15 105.33468 96.73456 TRUE
0.1-0.2 107.74597 96.73456 TRUE
0.1-0.25 150.98387 96.73456 TRUE
0.1-0.3 182.83871 96.73456 TRUE
0.15-0.2 2.41129 96.73456 FALSE
0.15-0.25 45.64919 96.73456 FALSE
0.15-0.3 77.50403 96.73456 FALSE
0.2-0.25 43.23790 96.73456 FALSE
0.2-0.3 75.09274 96.73456 FALSE
0.25-0.3 31.85484 96.73456 FALSE
These results seem to confirm our impression. Inductive Miner is sensitive to levels for reoccuring tasks, resulting in models of lower quality (measured in terms of precision values). However, from a level of around 15% reoccuring tasks in the log, this effect seems to have reached a plateau and stays stable.
Let’s continue with the F1-score. First we will verify if there are significant differences between the models generated by four different miners in terms of recall value.
First we check the homogeneity the normality assumption.
leveneTest(results$avg_f1, results$miner, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 175.25 < 2.2e-16 ***
3468
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_f1 ~ miner, data=results)
plot(model1, which=2)
The analysis illustrate that both assumptions are violated, so we proceed by focusing on the relative differences and the ranks of the generated models.
Therefore, we start by ranking the discovered models based on their precision value and compare the average ranking per miner.
print("Average Rank per Miner")
[1] "Average Rank per Miner"
results$rank_f1 <- rank(-results$avg_f1)
by(results$rank_f1, results$miner, mean)
results$miner: Alpha +
[1] 2321.445
-------------------------------------------------------------------
results$miner: Heuristics
[1] 2307.161
-------------------------------------------------------------------
results$miner: ILP
[1] 678.5081
-------------------------------------------------------------------
results$miner: Inductive
[1] 1638.887
Based on the average (precision) ranking, these results show again ILP creates the best models. The order between the other three miners is now such that Inductive miner produces relatively better models than Heuristics which seems to outperform Alpha+ miner. To get an idea of the distribution of the ranks, we create some violin plots.
results %>%
ggplot(aes(x= miner ,y =rank_f1)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on f1)")
These plots reveal a mixed picture between the results on recall and precision, which was to be expected. ILP outperforms the other miners, followed by Inductive and then by Alpha+ and Heuristics which seem more or less equally well performing.
Note, that these distributions focus on the relative differences between the quality of the models, not the absolute differences. For purpose of illustration, one can find a violin plot below which shows the distribution of the actual f1 values (instead of the ranks).
results %>%
ggplot(aes(x= miner ,y =avg_f1)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of f1 values")
We will now continue to focus on the relative differences instead. Based on the above analysis, it appears that the four miners generate models which rank on average differently (based on recall values). We will use a Kruskal-Wallis test to see if these differences in average ranking are statistically significant.
kruskal.test(avg_f1 ~ miner, results)
Kruskal-Wallis rank sum test
data: avg_f1 by miner
Kruskal-Wallis chi-squared = 1555.8, df = 3, p-value < 2.2e-16
These results show that there is a significant difference between the four miners. The order suggested by the results is ILP > Inductive > Heuristics > Alpha+ . We can use a post-hoc test to see which of these differences are significant.
kruskalmc(avg_f1 ~ miner, results)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
Alpha +-Heuristics 14.28399 126.9474 FALSE
Alpha +-ILP 1642.93664 126.9474 TRUE
Alpha +-Inductive 682.55818 126.9474 TRUE
Heuristics-ILP 1628.65265 126.9474 TRUE
Heuristics-Inductive 668.27419 126.9474 TRUE
ILP-Inductive 960.37846 126.9474 TRUE
It appears that except for the difference between Alpha+ and Heuristics, all pairwise comparisons are statisically significant. Thus, based on f1 values, models generated by ILP rank on average better than models generated by the ILP Miner, which rank on average better than models generated by the Alpha+ or Heuristics Miner.
Next, we want to test whether the presence/absence of infrequent behavior in the log has an impact on the average ranking based on f1 values. For this we will split the data between experiments with infrequent behavior and experiments without infrequent behavior and verify the ranking for these two subsets. We will start the analysis for experiments where infrequent behavior was absent.
First we will test to see if we can apply a standard ANOVA analysis
leveneTest(results_freq$avg_f1, results_freq$miner, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 86.517 < 2.2e-16 ***
1732
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_f1 ~ miner, data=results_freq)
plot(model1, which=2)
Both the Levene’s test as the QQ-plot reveal that both the assumption of Homogeneity and Normality are violated. Therefore we will focus on the comparison of average rankings.
print("Average Rank per Miner")
[1] "Average Rank per Miner"
results_freq$rank_f1 <- rank(-results_freq$avg_f1)
by(results_freq$rank_f1, results_freq$miner, mean)
results_freq$miner: Alpha +
[1] 1166.771
-------------------------------------------------------------------
results_freq$miner: Heuristics
[1] 1177.698
-------------------------------------------------------------------
results_freq$miner: ILP
[1] 341.1406
-------------------------------------------------------------------
results_freq$miner: Inductive
[1] 788.3906
results_freq %>%
ggplot(aes(x= miner ,y =rank_f1)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on f1)")
These analysis show a similar ranking between the four miners.
We will use a Kruskal-Wallis test to see if these differences in average ranking are statistically significant.
kruskal.test(avg_f1 ~ miner, results_freq)
Kruskal-Wallis rank sum test
data: avg_f1 by miner
Kruskal-Wallis chi-squared = 812.34, df = 3, p-value < 2.2e-16
These results show that there are significant ranking differences between the models created by the four miners in terms of precision. The order suggested by the results is ILP > Inductive > Alpha+ > Heuristics. We can use a post-hoc test to see which of these differences are significant.
kruskalmc(avg_f1 ~ miner, results_freq)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
Alpha +-Heuristics 10.92742 89.77831 FALSE
Alpha +-ILP 825.63018 89.77831 TRUE
Alpha +-Inductive 378.38018 89.77831 TRUE
Heuristics-ILP 836.55760 89.77831 TRUE
Heuristics-Inductive 389.30760 89.77831 TRUE
ILP-Inductive 447.25000 89.77831 TRUE
It appears that all pairwise comparisons are statisically significant different from each other, except again between Alpha+ and Heuristics.
First we will test to see if we can apply a standard ANOVA analysis
leveneTest(results_infreq$avg_f1, results_infreq$miner, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 95.61 < 2.2e-16 ***
1732
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_f1 ~ miner, data=results_infreq)
plot(model1, which=2)
Both the Levene’s test as the QQ-plot reveal that both the assumption of Homogeneity and Normality are violated. Therefore we will focus on the comparison of average rankings.
print("Average Rank per Miner")
[1] "Average Rank per Miner"
results_infreq$rank_f1 <- rank(-results_infreq$avg_f1)
by(results_infreq$rank_f1, results_infreq$miner, mean)
results_infreq$miner: Alpha +
[1] 1150.52
-------------------------------------------------------------------
results_infreq$miner: Heuristics
[1] 1135.862
-------------------------------------------------------------------
results_infreq$miner: ILP
[1] 334.2385
-------------------------------------------------------------------
results_infreq$miner: Inductive
[1] 853.3802
results_infreq %>%
ggplot(aes(x= miner ,y =rank_f1)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on F1)")
These analysis show again a similar ranking between the four miners.
We will use a Kruskal-Wallis test to see if these differences in average ranking are statistically significant.
kruskal.test(avg_f1 ~ miner, results_infreq)
Kruskal-Wallis rank sum test
data: avg_f1 by miner
Kruskal-Wallis chi-squared = 756.01, df = 3, p-value < 2.2e-16
These results show that there are significant ranking differences between the models created by the four miners in terms of F1 The order suggested by the plot is ILP > Inductive > Heuristics > Alpha+. We can use a post-hoc test to see which of these differences are significant.
kruskalmc(avg_f1 ~ miner, results_infreq)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
Alpha +-Heuristics 14.65783 89.77831 FALSE
Alpha +-ILP 816.28111 89.77831 TRUE
Alpha +-Inductive 297.13940 89.77831 TRUE
Heuristics-ILP 801.62327 89.77831 TRUE
Heuristics-Inductive 282.48157 89.77831 TRUE
ILP-Inductive 519.14171 89.77831 TRUE
It appears that all pairwise comparisons are statisically significant different from each other, except for the difference between Alpha+ and Heuristics.
Next, we will investigate how the performance of each of the mining algorithm is influenced by the level of reoccuring tasks in the log files.
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_alpha$avg_f1, results_alpha$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 3.4901 0.002044 **
861
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_f1 ~ reoccuringtasks, data= results_alpha)
plot(model1, which=2)
Both the assumption of homogneity and normality are violated. Therefore, we fall back to comparing the average rankings of the models instead.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_alpha$rank_f1 <- rank(-results_alpha$avg_f1)
by(results_alpha$rank_f1, results_alpha$reoccuringtasks, mean)
results_alpha$reoccuringtasks: 0
[1] 319.2621
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.05
[1] 358.5645
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.1
[1] 399.8911
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.15
[1] 452.9113
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.2
[1] 469.1492
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.25
[1] 493.7944
-------------------------------------------------------------------
results_alpha$reoccuringtasks: 0.3
[1] 547.9274
results_alpha %>%
ggplot(aes(x= reoccuringtasks ,y =rank_f1)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on F1)")
These data seem to suggest that as the level of reoccuring tasks increases, the models generated by Alpha miner deteriorate. However, it seems that this effect really starts to show from a level of 15% reoccuring tasks. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test.
kruskal.test(avg_f1~ reoccuringtasks, results_alpha)
Kruskal-Wallis rank sum test
data: avg_f1 by reoccuringtasks
Kruskal-Wallis chi-squared = 76.605, df = 6, p-value = 1.793e-14
jonckheere.test(results_alpha$avg_f1, as.numeric(results_alpha$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 124790, p-value = 0.002
alternative hypothesis: two.sided
These analyses show that there is statistically significant negative trend in the relative quality of the generated models as the amount of reoccuring tasks increases. Let’s verify this further by comparing different levels pairwise.
kruskalmc(avg_f1 ~reoccuringtasks, data=results_alpha)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
0-0.05 39.30242 96.73456 FALSE
0-0.1 80.62903 96.73456 FALSE
0-0.15 133.64919 96.73456 TRUE
0-0.2 149.88710 96.73456 TRUE
0-0.25 174.53226 96.73456 TRUE
0-0.3 228.66532 96.73456 TRUE
0.05-0.1 41.32661 96.73456 FALSE
0.05-0.15 94.34677 96.73456 FALSE
0.05-0.2 110.58468 96.73456 TRUE
0.05-0.25 135.22984 96.73456 TRUE
0.05-0.3 189.36290 96.73456 TRUE
0.1-0.15 53.02016 96.73456 FALSE
0.1-0.2 69.25806 96.73456 FALSE
0.1-0.25 93.90323 96.73456 FALSE
0.1-0.3 148.03629 96.73456 TRUE
0.15-0.2 16.23790 96.73456 FALSE
0.15-0.25 40.88306 96.73456 FALSE
0.15-0.3 95.01613 96.73456 FALSE
0.2-0.25 24.64516 96.73456 FALSE
0.2-0.3 78.77823 96.73456 FALSE
0.25-0.3 54.13306 96.73456 FALSE
These results illustrate that the quality of the models decrease significantly whenever the level of reoccuring tasks increase by 15%.
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_heuristics$avg_f1, results_heuristics$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 2.5386 0.01924 *
861
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_f1 ~ reoccuringtasks, data= results_heuristics)
plot(model1, which=2)
Both assumptions of Normality and Homogeneity are violated. We therefore continue with the KW-test and compare average ranks between the different levels of reoccuring tasks.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_heuristics$rank_f1 <- rank(-results_heuristics$avg_f1)
by(results_heuristics$rank_f1, results_heuristics$reoccuringtasks, mean)
results_heuristics$reoccuringtasks: 0
[1] 385.2903
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.05
[1] 415.8306
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.1
[1] 443.1411
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.15
[1] 484.4516
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.2
[1] 434.3911
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.25
[1] 436.2379
-------------------------------------------------------------------
results_heuristics$reoccuringtasks: 0.3
[1] 442.1573
results_heuristics %>%
ggplot(aes(x= reoccuringtasks ,y =rank_f1)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on f1)")
These data seem to suggest that while the F1 score might be negatively influenced by the presence of reoccuring tasks, the F1 score of models discovered by the Heuristics Miner is relatively stable against increasing levels of reoccuring tasks. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test.
kruskal.test(avg_f1~ reoccuringtasks, results_heuristics)
Kruskal-Wallis rank sum test
data: avg_f1 by reoccuringtasks
Kruskal-Wallis chi-squared = 10.666, df = 6, p-value = 0.09925
jonckheere.test(results_heuristics$avg_f1, as.numeric(results_heuristics$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 154160, p-value = 0.08
alternative hypothesis: two.sided
These analyses show that the data does not support the claim of a statistically significant trend in the relative quality of the generated models as the amount of reoccuring tasks increases.
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_ilp$avg_f1, results_ilp$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 26.868 < 2.2e-16 ***
861
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_f1 ~ reoccuringtasks, data= results_ilp)
plot(model1, which=2)
Both the assumptions of normality and homogeneity appear to be violated, thus we fall back to comparing the average rankings of the models instead.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_ilp$rank_f1 <- rank(-results_ilp$avg_f1)
by(results_ilp$rank_f1, results_ilp$reoccuringtasks, mean)
results_ilp$reoccuringtasks: 0
[1] 160.1935
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.05
[1] 294.1089
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.1
[1] 411.7702
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.15
[1] 444.0685
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.2
[1] 519.3508
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.25
[1] 598.1492
-------------------------------------------------------------------
results_ilp$reoccuringtasks: 0.3
[1] 613.8589
results_ilp %>%
ggplot(aes(x= reoccuringtasks ,y =rank_f1)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on f1)")
These data seem to suggest that the precision score of the models generated by ILP is very sensitive to the level of reoccuring tasks. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test.
kruskal.test(avg_f1~ reoccuringtasks, results_ilp)
Kruskal-Wallis rank sum test
data: avg_f1 by reoccuringtasks
Kruskal-Wallis chi-squared = 320.91, df = 6, p-value < 2.2e-16
jonckheere.test(results_ilp$avg_f1, as.numeric(results_ilp$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 84772, p-value = 0.002
alternative hypothesis: two.sided
These analyses show that there is a statistically significant negative trend in the relative quality of the generated models as the amount of reoccuring tasks increases. Let’s verify this further by comparing different levels pairwise.
kruskalmc(avg_f1 ~reoccuringtasks, data=results_ilp)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
0-0.05 133.91532 96.73456 TRUE
0-0.1 251.57661 96.73456 TRUE
0-0.15 283.87500 96.73456 TRUE
0-0.2 359.15726 96.73456 TRUE
0-0.25 437.95565 96.73456 TRUE
0-0.3 453.66532 96.73456 TRUE
0.05-0.1 117.66129 96.73456 TRUE
0.05-0.15 149.95968 96.73456 TRUE
0.05-0.2 225.24194 96.73456 TRUE
0.05-0.25 304.04032 96.73456 TRUE
0.05-0.3 319.75000 96.73456 TRUE
0.1-0.15 32.29839 96.73456 FALSE
0.1-0.2 107.58065 96.73456 TRUE
0.1-0.25 186.37903 96.73456 TRUE
0.1-0.3 202.08871 96.73456 TRUE
0.15-0.2 75.28226 96.73456 FALSE
0.15-0.25 154.08065 96.73456 TRUE
0.15-0.3 169.79032 96.73456 TRUE
0.2-0.25 78.79839 96.73456 FALSE
0.2-0.3 94.50806 96.73456 FALSE
0.25-0.3 15.70968 96.73456 FALSE
These results illustrate there is a clear significant trend, which somehow levels of around 25%.
First, we will check the assumptions for a standard ANOVA analysis
leveneTest(results_inductive$avg_f1, results_inductive$reoccuringtasks, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 6 23.671 < 2.2e-16 ***
861
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
model1 <- aov(avg_f1 ~ reoccuringtasks, data= results_inductive)
plot(model1, which=2)
Both the assumptions of normality and homogeneity appear to be violated, thus we fall back to comparing the average rankings of the models instead.
print("Average Rank per level Reoccuring Tasks")
[1] "Average Rank per level Reoccuring Tasks"
results_inductive$rank_f1 <- rank(-results_inductive$avg_f1)
by(results_inductive$rank_f1, results_inductive$reoccuringtasks, mean)
results_inductive$reoccuringtasks: 0
[1] 208.9435
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.05
[1] 287.1774
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.1
[1] 408.2137
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.15
[1] 511.9355
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.2
[1] 516.0323
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.25
[1] 521.1774
-------------------------------------------------------------------
results_inductive$reoccuringtasks: 0.3
[1] 588.0202
results_inductive %>%
ggplot(aes(x= reoccuringtasks ,y =rank_f1)) + geom_violin(scale="width", draw_quantiles = c(0.25, 0.5, 0.75)) +
ggtitle("Distribution of ranks (based on f1s)")
These data seem to suggest that as the level of reoccuring tasks increases, the models generated by Inductive miner deteriorate. To test this impression statistically, we will rely on the KW-test and the Jonckheere Test.
kruskal.test(avg_f1~ reoccuringtasks, results_inductive)
Kruskal-Wallis rank sum test
data: avg_f1 by reoccuringtasks
Kruskal-Wallis chi-squared = 231.79, df = 6, p-value < 2.2e-16
jonckheere.test(results_inductive$avg_f1, as.numeric(results_inductive$reoccuringtasks), nperm=1000)
Jonckheere-Terpstra test
data:
JT = 99196, p-value = 0.002
alternative hypothesis: two.sided
These analyses show that there is statistically significant negative trend in the relative quality of the generated models as the amount of reoccuring tasks increases. Let’s verify this further by comparing different levels pairwise.
kruskalmc(avg_f1 ~reoccuringtasks, data=results_inductive)
Multiple comparison test after Kruskal-Wallis
p.value: 0.05
Comparisons
obs.dif critical.dif difference
0-0.05 78.233871 96.73456 FALSE
0-0.1 199.270161 96.73456 TRUE
0-0.15 302.991935 96.73456 TRUE
0-0.2 307.088710 96.73456 TRUE
0-0.25 312.233871 96.73456 TRUE
0-0.3 379.076613 96.73456 TRUE
0.05-0.1 121.036290 96.73456 TRUE
0.05-0.15 224.758065 96.73456 TRUE
0.05-0.2 228.854839 96.73456 TRUE
0.05-0.25 234.000000 96.73456 TRUE
0.05-0.3 300.842742 96.73456 TRUE
0.1-0.15 103.721774 96.73456 TRUE
0.1-0.2 107.818548 96.73456 TRUE
0.1-0.25 112.963710 96.73456 TRUE
0.1-0.3 179.806452 96.73456 TRUE
0.15-0.2 4.096774 96.73456 FALSE
0.15-0.25 9.241935 96.73456 FALSE
0.15-0.3 76.084677 96.73456 FALSE
0.2-0.25 5.145161 96.73456 FALSE
0.2-0.3 71.987903 96.73456 FALSE
0.25-0.3 66.842742 96.73456 FALSE
These results seem to confirm our impression. Inductive Miner is sensitive to levels for reoccuring tasks, resulting in models of lower quality (measured in terms of precision values). However, from a level of around 15% reoccuring tasks in the log, this effect seems to have reached a plateau and stays stable.