Some of My Covid Work

The Situation

The Bluegrass Corpus

(Multilingual Lab, Eastern KY University)

Forty pairs of speakers (TED talks)
In each pair the two members have similar physical appearance, but
- one speaks with a “native” accent
- the other with a “foreign” accent
For each speaker there are 12 video clips in which he/she speaks a sentence
Ten study participants have rated each sentence for difficulty:
- -100 = completely easy
- +100 = could not be more difficult

Difficulty-Difference

For any two sentences, you can derive from the participant-ratings a numerical measure M of much they differ in their difficulty.

smaller number => more similar in difficulty
larger number => less similar

The Goal

We want users of the Corpus to be able to examine video/audio clips of pairs of speakers who:

look similar
are each speaking a sentence of about equal difficulty
but who differ considerably in their amount of accent (as perceived by Bluegrass “natives”)

The Task

From each pair, pick eight sentences and match them (sentence from native to sentence from foreign) so that the sum of the eight M’s is as small as possible.

The Issue

With 40 speaker pairs and 12 sentences from each speaker, a brute-force approach would require the construction of about 395 billion matchings.

Not enough time!

The Hungarian Algorithm

Weighted Bipartite Graph

graph

“matching three workers with three jobs”

Modelling

Each vertex represents a sentence
Vertices comprised of two disjoint sets:
- one set represents sentences from the native speaker
- the other represents sentences from the foreign speaker
Edges go only between vertices in different sets
The number on each edge is the difficulty-difference measure