This notebook presents the results of the Pyannote.audio 2.0 pipeline, an open-source toolkit written in Python for speaker diarization. Pyannote.audio is based on PyTorch machine learning framework, and it comes with a pre-trained model of speaker diarization, which we have used here.

The pre-trained pyannote.audio speaker diarization pipeline has been run on a ~70 min audio file. The file was about a public debate involving 7 Dutch politicians moderated by a presenter. A human annotated version of speakers’ turns and labels during the debate is also available as a reference.

The notebook reports the number (figure 1) and type (figure 2) of speech segments found by the model in relation to the reference, and the speakers’ confusion rate (figure 3) along with accuracy/F1, precision and recall metrics.

Read pyannote model’s output

read.table(paste0(path,"model_annotation_ded21.rttm"),stringsAsFactors = T, header = F, sep = " ", dec = ".") -> ded21

colnames <- c("type", "file", "chnl", "tbeg", "tdur", 
              "ortho", "stype", "name", "conf")
names(ded21) <- colnames
ded21 <- data.frame(ded21)
head(ded21)
##      type file chnl  tbeg  tdur ortho stype       name conf  NA.
## 1 SPEAKER <NA>    1  2.00  0.22  <NA>  <NA> SPEAKER_04 <NA> <NA>
## 2 SPEAKER <NA>    1  2.50 14.56  <NA>  <NA> SPEAKER_04 <NA> <NA>
## 3 SPEAKER <NA>    1 17.22 18.39  <NA>  <NA> SPEAKER_04 <NA> <NA>
## 4 SPEAKER <NA>    1 33.94  1.03  <NA>  <NA> SPEAKER_03 <NA> <NA>
## 5 SPEAKER <NA>    1 35.61  0.52  <NA>  <NA> SPEAKER_03 <NA> <NA>
## 6 SPEAKER <NA>    1 40.91 30.29  <NA>  <NA> SPEAKER_00 <NA> <NA>
Assign speakers’ mappings based on optimal mapping algorithm from pyannote
mapping <- data.frame(
  
  SPEAKER_00 = 'Wilders',
  SPEAKER_01= 'Kaag',
  SPEAKER_02= 'Marijnissen',
  SPEAKER_03= 'Rutte',
  SPEAKER_04= 'Hoekstra',
  SPEAKER_05= 'Klaver'
)
mapping
##   SPEAKER_00 SPEAKER_01  SPEAKER_02 SPEAKER_03 SPEAKER_04 SPEAKER_05
## 1    Wilders       Kaag Marijnissen      Rutte   Hoekstra     Klaver

Load reference

##      type file chnl tbeg tdur ortho stype        name conf   NA
## 1 SPEAKER <NA>    1    2   34  <NA>  <NA>     Wilders <NA> <NA>
## 2 SPEAKER <NA>    1   41   32  <NA>  <NA> Marijnissen <NA> <NA>
## 3 SPEAKER <NA>    1   77   30  <NA>  <NA>    Hoekstra <NA> <NA>
## 4 SPEAKER <NA>    1  109   33  <NA>  <NA>        Kaag <NA> <NA>
## 5 SPEAKER <NA>    1  143   32  <NA>  <NA>       Rutte <NA> <NA>
## 6 SPEAKER <NA>    1  177   28  <NA>  <NA>      Klaver <NA> <NA>
Number of speech segments reported in the reference and predicted by the model per speaker
Figure 1

Figure 1

Visualize the number and type of speech segments predicted by pyannote in comparison to the reference.

Audio lasts ~ 70 min in total. Each quadrant contains ~ 16 min of recording.
Figure 2

Figure 2

Confusion matrix shows the association between the actual Vs predicted speakers’ labels as a function of speech duration (in seconds).

How to look at this graph: Ideally, matrices are squared (i.e., they have the same number of predicted/true values), and so a perfect match between true and predicted speakers’ labels is attested by a red-coloured diagonal line along the matrix. In our case the model has failed to assign Burger’s label and so the matrix is not perfectly squared.

From the plot: Overall, Marijnissen is the best recognised speaker. Wilders is the worst recognised speaker, as it is more often confused with Klaver, or Hoekstra. Longer speech is not always associated to better prediction from the model.

Figure 3

Figure 3

Compute pyannote’s accuracy/F1 score, precision and recall

To evaluate the model we need four components which are a combination of match and mismatch speech segments between the reference and the model.

When the reference and the model are in agreement:

When the model contradicts the reference:

Accuracy

Accuracy is thus defined as the ratio of correctly predicted observation over the total observations. Q: “How many times the model correctly predicts the speaker’s label?”

\[ Accuracy = \frac{True\_negative + True\_positive}{True\_negative + True\_positive+ False\_negative + False\_positive} \]

## [1] "Wilders - Accuracy = 0.73"
## [1] "Marijnissen - Accuracy = 0.93"
## [1] "Hoekstra - Accuracy = 0.71"
## [1] "Kaag - Accuracy = 0.91"
## [1] "Rutte - Accuracy = 0.87"
## [1] "Klaver - Accuracy = 0.78"
## [1] "Burger - Accuracy = 0.96"

Precision

For our scope, this is the most important metric. Precision is the ratio of correctly predicted positive observations over the total predicted positive observations. Q: “How many times the model correctly predicted the speaker’s label when the speaker was speaking?”

\[ Precision = \frac{True\_positive}{True\_positive+ False\_positive} \]

## [1] "Wilders - Precision = 0.07"
## [1] "Marijnissen - Precision = 0.92"
## [1] "Hoekstra - Precision = 0.24"
## [1] "Kaag - Precision = 0.77"
## [1] "Rutte - Precision = 0.75"
## [1] "Klaver - Precision = 0.35"
## [1] "Burger - Precision = NaN"

Recall (Sensitivity)

Recall is the ratio of correctly predicted positive observations over all observations in the actual class. Q: “Of all the speakers present in the debate, how many did the model label?”

\[ Recall = \frac{True\_positive}{True\_positive+ False\_negative} \]

## [1] "Wilders - Recall = 0.05"
## [1] "Marijnissen - Recall = 0.66"
## [1] "Hoekstra - Recall = 0.46"
## [1] "Kaag - Recall = 0.54"
## [1] "Rutte - Recall = 0.48"
## [1] "Klaver - Recall = 0.62"
## [1] "Burger - Recall = 0"

F1 score

F1 Score is the weighted average of Precision and Recall. This score thus takes both false positives and false negatives into account. F1 in our case is much more useful than accuracy because of the uneven distribution of false postives over false negatives. Accuracy works best if both metrics have a similar distribution.

\[ F1 score = \frac{2*(Recall * Precision)}{(Recall + Precision)} \]

speakers F1 score
Wilders 0.06
Marijnissen 0.77
Hoekstra 0.32
Kaag 0.63
Rutte 0.59
Klaver 0.45
Burger NaN