The Situation

The Bluegrass Corpus

(Multilingual Lab, Eastern KY University)

  • Forty pairs of speakers (TED talks)
  • In each pair the two members have similar physical appearance, but
    • one speaks with a “native” accent
    • the other with a “foreign” accent
  • For each speaker there are 12 video clips in which he/she speaks a sentence
  • Ten study participants have rated each sentence for difficulty:
    • -100 = completely easy
    • +100 = could not be more difficult

Difficulty-Difference

For any two sentences, you can derive from the participant-ratings a numerical measure M of much they differ in their difficulty.

  • smaller number => more similar in difficulty
  • larger number => less similar

The Goal

We want users of the Corpus to be able to examine video/audio clips of pairs of speakers who:

  • look similar
  • are each speaking a sentence of about equal difficulty
  • but who differ considerably in their amount of accent (as perceived by Bluegrass “natives”)

The Task

From each pair, pick eight sentences and match them (sentence from native to sentence from foreign) so that the sum of the eight M’s is as small as possible.

The Issue

With 40 speaker pairs and 12 sentences from each speaker, a brute-force approach would require the construction of about 395 billion matchings.

Not enough time!

The Hungarian Algorithm

Weighted Bipartite Graph

graph

“matching three workers with three jobs”

Modelling

  • Each vertex represents a sentence
  • Vertices comprised of two disjoint sets:
    • one set represents sentences from the native speaker
    • the other represents sentences from the foreign speaker
  • Edges go only between vertices in different sets
  • The number on each edge is the difficulty-difference measure

Focus on Matching 8 with 8

A small trick reduces the problem (for each speaker pair) to finding the optimal matching in a complete bipartite 8-8 graph.

Still a Lot of Matchings to Compare!

There are

\[8! = 8 \times 7\times 6 \times 5 \times 4 \times 3 \times 2 = 40,320\]

possible matchings. That’s a lot …

Hungarian Algorithm

… but the Hungarian Algorithm (Kuhn and Munkres, 1956) searches quickly and systematically for an optimal one.

Graph Again

graph

Example 1

Let’s look at a case with 5 workers and 5 jobs:

First Example

Example 1

Explore for Yourself

Letting Others Do the Heavy-Lifting

I did all of my work

  • data import,
  • data transformation, and
  • the matching

in the R programming language. (It helped that Kurt Hornik has a fine implementation of the Hungarian algorithm in his R-package clue.)

(1% mathematical modeling, 99% secretarial work)

Here is my report.