In this evaluation, I take 100 utterances from the original data. I call this baseline. The baseline utterances are repeated. For each repeat of a baseline utterance, I create a different rephrasing. I then classify the original utterance and the rephrasing, and compare the two classifications.

For example:

utterance_baseline utterance_new
how can i add something to my cart? how do i put something to my cart?
how can i add something to my cart? how can i change the items in my cart?
how can i add something to my cart? how can i add items to my cart?

This is a work in progress.

Maybe the rephrasing could be automated with GPT3?

baseline %>%
    inner_join(mapping) %>%
    inner_join(evaluation) %>%
    mutate(match = node_baseline == node_new) %>%
    select(utterance_baseline, utterance_new, label_baseline, label_new, match, node_baseline, node_new)