This notebook presents the results of the Pyannote.audio 2.0 pipeline, an open-source toolkit written in Python for speaker diarization. Pyannote.audio is based on PyTorch machine learning framework, and it comes with a pre-trained model of speaker diarization, which we have used here.
The pre-trained pyannote.audio speaker diarization pipeline has been run on a ~70 min audio file. The file was about a public debate involving 7 Dutch politicians moderated by a presenter. A human annotated version of speakers’ turns and labels during the debate is also available as a reference.
The notebook reports the number (figure 1) and type (figure 2) of speech segments found by the model in relation to the reference, and the speakers’ confusion rate (figure 3) along with accuracy/F1, precision and recall metrics.
read.table(paste0(path,"model_annotation_ded21.rttm"),stringsAsFactors = T, header = F, sep = " ", dec = ".") -> ded21
colnames <- c("type", "file", "chnl", "tbeg", "tdur",
"ortho", "stype", "name", "conf")
names(ded21) <- colnames
ded21 <- data.frame(ded21)
head(ded21)
## type file chnl tbeg tdur ortho stype name conf NA.
## 1 SPEAKER <NA> 1 2.00 0.22 <NA> <NA> SPEAKER_04 <NA> <NA>
## 2 SPEAKER <NA> 1 2.50 14.56 <NA> <NA> SPEAKER_04 <NA> <NA>
## 3 SPEAKER <NA> 1 17.22 18.39 <NA> <NA> SPEAKER_04 <NA> <NA>
## 4 SPEAKER <NA> 1 33.94 1.03 <NA> <NA> SPEAKER_03 <NA> <NA>
## 5 SPEAKER <NA> 1 35.61 0.52 <NA> <NA> SPEAKER_03 <NA> <NA>
## 6 SPEAKER <NA> 1 40.91 30.29 <NA> <NA> SPEAKER_00 <NA> <NA>
mapping <- data.frame(
SPEAKER_00 = 'Wilders',
SPEAKER_01= 'Kaag',
SPEAKER_02= 'Marijnissen',
SPEAKER_03= 'Rutte',
SPEAKER_04= 'Hoekstra',
SPEAKER_05= 'Klaver'
)
mapping
## SPEAKER_00 SPEAKER_01 SPEAKER_02 SPEAKER_03 SPEAKER_04 SPEAKER_05
## 1 Wilders Kaag Marijnissen Rutte Hoekstra Klaver
## type file chnl tbeg tdur ortho stype name conf NA
## 1 SPEAKER <NA> 1 2 34 <NA> <NA> Wilders <NA> <NA>
## 2 SPEAKER <NA> 1 41 32 <NA> <NA> Marijnissen <NA> <NA>
## 3 SPEAKER <NA> 1 77 30 <NA> <NA> Hoekstra <NA> <NA>
## 4 SPEAKER <NA> 1 109 33 <NA> <NA> Kaag <NA> <NA>
## 5 SPEAKER <NA> 1 143 32 <NA> <NA> Rutte <NA> <NA>
## 6 SPEAKER <NA> 1 177 28 <NA> <NA> Klaver <NA> <NA>
Figure 1
Figure 2
How to look at this graph: Ideally, matrices are squared (i.e., they have the same number of predicted/true values), and so a perfect match between true and predicted speakers’ labels is attested by a red-coloured diagonal line along the matrix. In our case the model has failed to assign Burger’s label and so the matrix is not perfectly squared.
From the plot: Overall, Marijnissen is the best recognised speaker. Wilders is the worst recognised speaker, as it is more often confused with Klaver, or Hoekstra. Longer speech is not always associated to better prediction from the model.
Figure 3
To evaluate the model we need four components which are a combination of match and mismatch speech segments between the reference and the model.
When the reference and the model are in agreement:
True Positives - These are the correctly predicted positive values which means that the speaker’s label of the reference was present and the model has correctly predicted it.
True Negatives - These are the correctly predicted negative values which means that the speaker’s label of the reference wasn’t present and the model has correctly NOT found it.
When the model contradicts the reference:
False Positives – When the speaker’s label of the reference wasn’t present and the model instead predicted its presence.
False Negatives - When actual class is yes but predicted class in no. E.g. if actual class value indicates that this passenger survived and predicted class tells you that passenger will die.
Accuracy is thus defined as the ratio of correctly predicted observation over the total observations. Q: “How many times the model correctly predicts the speaker’s label?”
\[ Accuracy = \frac{True\_negative + True\_positive}{True\_negative + True\_positive+ False\_negative + False\_positive} \]
## [1] "Wilders - Accuracy = 0.73"
## [1] "Marijnissen - Accuracy = 0.93"
## [1] "Hoekstra - Accuracy = 0.71"
## [1] "Kaag - Accuracy = 0.91"
## [1] "Rutte - Accuracy = 0.87"
## [1] "Klaver - Accuracy = 0.78"
## [1] "Burger - Accuracy = 0.96"
For our scope, this is the most important metric. Precision is the ratio of correctly predicted positive observations over the total predicted positive observations. Q: “How many times the model correctly predicted the speaker’s label when the speaker was speaking?”
\[ Precision = \frac{True\_positive}{True\_positive+ False\_positive} \]
## [1] "Wilders - Precision = 0.07"
## [1] "Marijnissen - Precision = 0.92"
## [1] "Hoekstra - Precision = 0.24"
## [1] "Kaag - Precision = 0.77"
## [1] "Rutte - Precision = 0.75"
## [1] "Klaver - Precision = 0.35"
## [1] "Burger - Precision = NaN"
Recall is the ratio of correctly predicted positive observations over all observations in the actual class. Q: “Of all the speakers present in the debate, how many did the model label?”
\[ Recall = \frac{True\_positive}{True\_positive+ False\_negative} \]
## [1] "Wilders - Recall = 0.05"
## [1] "Marijnissen - Recall = 0.66"
## [1] "Hoekstra - Recall = 0.46"
## [1] "Kaag - Recall = 0.54"
## [1] "Rutte - Recall = 0.48"
## [1] "Klaver - Recall = 0.62"
## [1] "Burger - Recall = 0"
F1 Score is the weighted average of Precision and Recall. This score thus takes both false positives and false negatives into account. F1 in our case is much more useful than accuracy because of the uneven distribution of false postives over false negatives. Accuracy works best if both metrics have a similar distribution.
\[ F1 score = \frac{2*(Recall * Precision)}{(Recall + Precision)} \]
| speakers | F1 score |
|---|---|
| Wilders | 0.06 |
| Marijnissen | 0.77 |
| Hoekstra | 0.32 |
| Kaag | 0.63 |
| Rutte | 0.59 |
| Klaver | 0.45 |
| Burger | NaN |