Sentence-Matching in the Bluegrass Corpus: Methodological Considerations

Introduction

The Bluegrass Corpus includes a set of 40 pairs of TED-talk speakers who are similar in many respects but who differ in whether or not their accent is deemed “foreign” or “native.” Each speaker has twelve video clips in which he or she is speaking a sentence. All 24 sentences associated with a particular pair of speakers were rated for difficulty by ten participants. For each speaker-pair we chose eight sentences from each speaker and matched them with each other—a sentence from the “native” speaker to a sentence from the “foreign” speakers—in such a way that the mean rated difficulties of the sentences in each pair are as similar as possible. In keeping with the ideals of reproducible research, this technical report aims to describe the matching procedure precisely, so that other researchers can assess the quality of the procedure and can use the procedure to reproduce the matching.

The procedure is implemented in the R programming language R Core Team (2019). There are two ways for R-users to reproduce the matching:

Those who are familiar with git may download or clone the author’s project repository https://github.com/homerhanumat/assignment on Github. In the root directory of the project, find the R Markdown source file for the article and knit it.
Alternatively a user may copy the displayed code into an R-script and run it.

Preliminaries

In order to use the matching algorithm, make sure you have installed some packages from CRAN:

install.packages(
  c(
    "readxl",       ## for data import
    "clue",         ## implement the Hungarian Algorithm
    "DescTools",    ## Yuen's trimmed t-test
    "tidyverse"     ## packages for data wrangling and graphics
    )
  )

You will want to attach the tidyverse packages:

library(tidyverse)

Data Import and Munging

First we download the data. This can be done manually from the Open Science Framework https://osf.io/fd4uj/download, or in R-code from the author’s Github repository:

## make a directory to store data:
if (!dir.exists("data")) {
  dir.create("data")
}
## get the excel file:
download.file(
  url = "https://github.com/homerhanumat/assignment/raw/master/data/Experiment3.xlsx",
  destfile = "data/Experiment3.xlsx"
)

(Note that the data resides in a folder named data.)

Then we read the data into our R session:

sentences1 <- readxl::read_excel("data/Experiment3.xlsx")

Some data-munging:

sentences <-
  sentences1 %>% 
  ## pair-id came in as character vector, so change to integers:
  mutate(Pair = as.integer(Pair)) %>% 
  ## ditch most of the columns:
  select(Participant, Pair, Sentence, Accent, Condition, Rating) %>% 
  ## rename most of the variables
  rename(participant_id = Participant,
         pair_id = Pair,
         sentence_id = Sentence,
         speaker_accent = Accent,
         difficulty = Condition,
         rating = Rating) %>% 
  ## change rating scale from [-1, 1] to [100, 100]
  mutate(rating = rating * 100) %>% 
  ## reshape data so that each row repreents a single sentence:
  arrange(sentence_id) %>% 
  mutate(participant = rep(1:10, times = 960)) %>% 
  select(-participant_id) %>% 
  spread(key = participant, value = rating, sep = "_rating_")

## give better names to the columns containing particpant ratings:
names(sentences)[5:14] <- paste("participant", 1:10, "rating", sep = "_")

## rearrange for easier viewing
sentences <-
  sentences %>% 
  arrange(pair_id)

## Duplicate suentences, discovered by Bailey McGuffey:
## Eliminate the repeated sentences 
## from the dataset:

bad_ids <- c("14AE3", "33BE5", "45BD11", "33BE4", "33BD9")
sentences <-
  sentences %>% 
  filter(!(sentence_id %in% bad_ids))

Here is the transformed data:

Sentence-Matching

The Idea of the Routine

For almost every speaker-pair, we had access to twelve sentences from each speaker in the pair. (Three sentences had to be removed from the original data, resulting in three instances in which a speaker had only eleven sentences.)

For each speaker-pair, we wish to desire to find a set of eight sentences from the Foreign speaker and to match them one-to-one with the members of a set of eight sentences from the Native speaker, in such a way that the resulting matched sentences are as similar as possible in terms of their difficulty-ratings as given by the ten subjects who heard the sentences of both speakers.

As a criterion for similarity of difficulty of two given sentences, we use the absolute value of the t-statistic in a paired t-test involving the ten ratings for each sentence. As a measure of overall similarity in a proposed matching, we use the sum of the eight absolute values. The smaller this sum, the more “similar” we deem the matching to be.

To get a better idea of what we are attempting, consider the following matrix, which pertains to the pair of speakers with ID-number 1:

	1BD10	1BD11	1BD12	1BD7	1BD8	1BD9	1BE1	1BE2	1BE3	1BE4	1BE5	1BE6
1AD10	0.328	1.744	1.325	1.128	1.872	0.499	0.944	0.257	1.107	1.374	0.870	1.042
1AD11	0.609	1.034	0.187	0.051	0.112	0.370	0.047	0.808	0.169	0.073	0.022	0.075
1AD12	1.118	0.439	0.291	0.072	0.241	0.310	0.032	0.811	0.481	0.116	0.062	0.090
1AD7	1.204	0.264	0.601	0.457	0.518	1.168	0.897	1.285	0.277	0.445	0.437	0.713
1AD8	1.595	0.026	0.947	0.449	0.589	0.686	0.477	1.585	0.183	0.323	0.339	0.588
1AD9	2.041	0.645	0.987	0.924	1.088	1.213	1.101	1.650	0.815	0.787	1.057	1.333
1AE1	1.952	0.163	0.809	0.117	0.153	0.520	0.335	1.125	0.055	0.136	0.147	0.487
1AE2	1.141	0.208	0.662	0.368	0.472	1.011	0.683	1.299	0.246	0.530	0.399	0.546
1AE3	1.311	0.464	0.901	0.534	0.659	1.294	0.846	1.786	0.387	1.223	0.582	0.804
1AE4	0.077	1.176	0.838	0.732	1.359	0.294	0.799	0.093	1.313	0.951	0.697	0.808
1AE5	0.563	2.325	1.352	2.016	1.976	1.007	1.139	1.057	1.237	1.665	1.052	1.441
1AE6	0.409	0.641	0.156	0.320	0.800	0.016	0.527	0.414	0.687	0.491	0.379	0.480

The row-names of the above matrix are sentence-IDs for the Foreign speaker; the column names are IDs of the sentences spoken by the Native speaker. Each cell is the absolute value of the paired t-test applied to a sentence-pair. Thus, for example, in a paired t-test for sentences 1AD10 and 1BD10 the absolute value of the t-test statistic was about 0.328.

(As is the case for most of the speakers, the matrix is 12-by-12 since we had twelve sentences from each speaker. However, three of the speaker-pairs had misssing sentences from one of the speakers; for these pairs the matrix was correspondingly smaller.)

In any event, our task is find a set of eight rows and a set of eight columns and a matching between them so as to make the sum of the corresponding eight cells as small as possible.

For the matrix above, the best choice turns out to be as follows:

	1BD10	1BD11	1BD7	1BD9	1BE1	1BE2	1BE3	1BE5
1AD10						0.257
1AD11								0.022
1AD12					0.032
1AD7
1AD8		0.026
1AD9
1AE1			0.117
1AE2							0.246
1AE3
1AE4	0.077
1AE5
1AE6				0.016

In other words, the best matching is:

foreign	native
1AD10	1BE2
1AD11	1BE5
1AD12	1BE1
1AD8	1BD11
1AE1	1BD7
1AE2	1BE3
1AE4	1BD10
1AE6	1BD9

But there are about 9.9 billion matchings to compare! Trying them all one-by-one one takes much too long on most personal computers.

It turns out that problem is an instance of the well-known Assignment Problem in applied combinatorics, and several efficient solutions are available. We choose to use the Hungarian Algorithm, as implemented in R-package clue, authored by Kurt Hornik.

In order to apply the Hungarian Algorithm we must begin with matrix in which each row-item is matched with a column item, so we cannot work directly with the original 12-by-12 matrix: we need an 8-by-12 matrix instead. Therefore we create, for each possible set of eight sentences from the 12 sentences of the Foreign speaker, a matrix in which the rows are the members of the set. One such matrix is as follows:

	1BD10	1BD11	1BD12	1BD7	1BD8	1BD9	1BE1	1BE2	1BE3	1BE4	1BE5	1BE6
1AD10	0.328	1.744	1.325	1.128	1.872	0.499	0.944	0.257	1.107	1.374	0.870	1.042
1AD11	0.609	1.034	0.187	0.051	0.112	0.370	0.047	0.808	0.169	0.073	0.022	0.075
1AD12	1.118	0.439	0.291	0.072	0.241	0.310	0.032	0.811	0.481	0.116	0.062	0.090
1AD7	1.204	0.264	0.601	0.457	0.518	1.168	0.897	1.285	0.277	0.445	0.437	0.713
1AD8	1.595	0.026	0.947	0.449	0.589	0.686	0.477	1.585	0.183	0.323	0.339	0.588
1AD9	2.041	0.645	0.987	0.924	1.088	1.213	1.101	1.650	0.815	0.787	1.057	1.333
1AE1	1.952	0.163	0.809	0.117	0.153	0.520	0.335	1.125	0.055	0.136	0.147	0.487
1AE2	1.141	0.208	0.662	0.368	0.472	1.011	0.683	1.299	0.246	0.530	0.399	0.546

(This matrix corresponds to choosing the first eight sentences from the Foreign speaker.) We apply the Hungarian Algorithm to the matrix, arriving at the following result:

foreign	native	statistic
1AD10	1BE2	0.2565733
1AD11	1BE5	0.0223997
1AD12	1BE1	0.0321802
1AD7	1BE3	0.2772978
1AD8	1BD11	0.0260588
1AD9	1BE4	0.7869108
1AE1	1BD8	0.1529699
1AE2	1BD7	0.3682881

Of course we must apply the Hungarian algorithm for every possible set of eight sentences from the Foreign speaker. That’s a total of 495 sets, but the Hungarian algorithm is so efficient that the whole process runs very quickly on a personal computer, as we shall soon see. We retain the set whose best matching to the Native speaker has the highest possible similarity, i.e., the smallest sum of absolute values of t-statistics, arriving at the solution we showed earlier.

We also need to repeat the process for all forty pairs of speakers.

The Code

## Utility function to measure the difference in difficulty ratings.
## Use the Yuen t-test that allows for working with trimmed means.
## Returns the test-statistic and the P-value.
profile_diff <- function(x, y, trim) {
  test <- DescTools::YuenTTest(x, y, trim = trim, paired = TRUE)
  list(
    statistic = abs(test$statistic),
    p_value = test$p.value
  )
}

## Utility function
ratings_from_pair <- function(pair) {
  foreign <- pair %>% 
    filter(speaker_accent == "Foreign") %>% 
    select(ends_with("rating")) %>% 
    t() %>% 
    as.matrix()
  native <- pair %>% 
    filter(speaker_accent == "Native") %>% 
    select(ends_with("rating")) %>% 
    t() %>% 
    as.matrix()
  list(foreign = foreign, native = native)
}

## Utility function to make matrices of statistics and P-values
## for all pairs of sentences:  one from Speaker A (Foreign), the other
## from speaker B (Native).
make_diff_matrix <- function(pair, trim) {
  ratings <- ratings_from_pair(pair)
  a <- ratings$foreign
  b <- ratings$native
  mat_statistics <- matrix(0, nrow = ncol(a), ncol = ncol(b))
  mat_pvals <- matrix(0, nrow = ncol(a), ncol = ncol(b))
  for (i in 1:ncol(a)) {
    for (j in 1:ncol(b)) {
      result <- profile_diff(x = a[, i], y = b[, j], trim = trim)
      mat_statistics[i, j] <- result$statistic
      mat_pvals[i, j] <- result$p_value
    }
  }
  rownames(mat_statistics) <- pair %>% 
    filter(speaker_accent == "Foreign") %>%
    pull(sentence_id)
  colnames(mat_statistics) <- pair %>% 
    filter(speaker_accent == "Native") %>%
    pull(sentence_id)
  rownames(mat_pvals) <- pair %>% 
    filter(speaker_accent == "Foreign") %>%
    pull(sentence_id)
  colnames(mat_pvals) <- pair %>% 
    filter(speaker_accent == "Native") %>%
    pull(sentence_id)
  list(
    mat_statistics = mat_statistics,
    mat_pvals = mat_pvals
  )
}

## Utility function.  Given a pair of speakers, find the best
## matching of sentences.  Uses brute force to go through
## all possible pairs of subsets of cardinality "select" from speaker A,
## with each A-set using the Hungarian Algorithm
## to find the best matching set of sentences from Speaker B.
## Keeps track of the best matching.
## Returns a data frame of results.
get_best <- function(pair, select, trim) {
  
  ## get sentence ids for each speaker:
  a_sentence_ids <- pair %>% 
    filter(speaker_accent == "Foreign") %>% 
    pull(sentence_id)
  b_sentence_ids <- pair %>% 
    filter(speaker_accent == "Native") %>% 
    pull(sentence_id)
  
  ## compute matrices of difference-statisitics and p-values,
  ## for each pair of sentences
  diff_mats <- make_diff_matrix(pair, trim = trim)
  
  ## make all possible subsets of cardinality select from
  ## the sentences for A-speker
  size <- nrow(diff_mats[[1]])
  a_subsets_numeric <- utils::combn(x = size, m = select)
  
  ## prepare to loop through all A-speaker subsets
  n <- ncol(a_subsets_numeric)
  a_best <- ""
  b_best <- ""
  statistic <- 0
  pval <- 0
  diff_best <- Inf
  
  ## begin looping
  for (i in 1:n) {
    
    ## extract sentence ids from the numbers:
    subset_a <- a_sentence_ids[a_subsets_numeric[, i]]
    
    ## extract the relvant portion of the difference matrices
    dms <- diff_mats$mat_statistics[subset_a, ]
    
    ## Now comes the Hungarian Algorithm ...
    solution <- clue::solve_LSAP(x = dms, maximum = FALSE)
    
    ## extract B-speaker sentence-ids from the solution
    subset_b <- b_sentence_ids[solution]
    
    ## get staitics and p-values for each pair in the
    ## best matching
    sentence_pair_statistics <- numeric(select)
    sentence_pair_pvals <- numeric(select)
    for (i in 1:select) {
      a_location <- subset_a[i]
      b_location <- subset_b[i]
      sentence_pair_statistics[i] <- diff_mats$mat_statistics[a_location, b_location]
      sentence_pair_pvals[i] <- diff_mats$mat_pvals[a_location, b_location]
    }
    
    ## check to see if this best match is better than a best-match
    ## froma previously-analysed A-subsets:
    sum_statistics <- sum(sentence_pair_statistics)
    if (sum_statistics < diff_best) {
      diff_best <- sum_statistics
      a_best <- subset_a
      b_best <- subset_b
      statistic <- sentence_pair_statistics
      pval <- sentence_pair_pvals
    }
  }
  
  ## return data frame of results for the best possible matching;
  data.frame(
    foreign = a_best,
    native = b_best,
    statistic = statistic,
    pval = pval)
}

## match_sentences ----

## This is the function you'll actually use.
## data is the original data frame
## select = desired number of sentence-pairs
##
## trim is there in case researchers desire to omit
## very high or low ratings.  (Setting trim = 0.1 would knock out
## the highest and the lowest of the ten differences in ratings.)
##
## Setting trace to TRUE results in a progress report to the console
## (not needed for the current small study)
##
## Result is a list of data frames, one for each speaker-pair,
## saying which sentence goes to which, and reporting the P-values
## of the Yuen T-Test.  This allows the user to flag pairs where the
## ratings are "too different".
match_sentences <- function(data, select, trim = 0, trace = FALSE) {
  pair_ids <- sort(unique(data$pair_id))
  pairs <- length(unique(data$pair_id))
  lst <- vector(mode = "list", length = pairs)
  for (i in 1:pairs) {
    if (trace) {
      cat("Working on speaker pair with id", pair_ids[i], "...\n")
    }
    pair <- data %>% 
      filter(pair_id == pair_ids[i])
    results <-  get_best(pair, select, trim)
    lst[[i]] <- results
  }
  names(lst) <- pair_ids
  lst
}

Implementation

Now we run the algorithm, using `system.time() to record how long it takes:

system.time(
  results <- match_sentences(
    data = sentences,
    select = 8,
    trim = 0,  ## the default, actually,
    trace = FALSE
  )
)

   user  system elapsed 
  7.034   0.075   7.164

Thanks to the Hungarian Algorithm the routine finishes in a satisfyingly small amount of time.

results is a list, each element of which is a data frame showing how to do the matching of sentences for a given pair of speakers. We can view the matching for the speaker-pair with id 17 as follows:

results[["17"]]

foreign	native	statistic	pval
17AD10	17BD8	0.1290179	0.9001807
17AD12	17BD12	0.0017601	0.9986340
17AD9	17BE4	0.0989698	0.9233316
17AE1	17BE1	0.2760973	0.7887085
17AE2	17BD10	0.1696885	0.8690081
17AE3	17BD11	0.1310586	0.8986118
17AE4	17BD9	0.0401374	0.9688599
17AE6	17BE3	0.1955255	0.8493247

Note that we have retained the P-values for each paired t-test. We can use them as a check on the similarity of sentences. If we find sentence pairs for which the P-values are very small then we may choose not to include them in the Corpus.

Post-Hoc Analysis

Verifying Sentence-Similarity

Let’s gather all of the P-values:

pvals <- numeric()
for (i in 1:length(results)) {
  pvals <- c(pvals, results[[i]]$pval)
}

The smallest P-value is:

min(pvals)

[1] 0.4440327

Most of the P-values are quite high, as we see from the following density plot:

ggplot(data = NULL, aes(x = pvals)) +
  geom_density(fill = "burlywood") +
  geom_rug() +
  labs(x = "P-values")

Figure 1: t-test P-values for all 320 sentence pairs

It appears that the matching routine has succeeded in identifying sentence-pairs that are similar—at least with respect to mean difficulty!

Investigating Difficulty of Sentences

We worked with 955 sentences, each of which were rated by ten study-participants. Here is density plot of all 9550 individual difficulty-ratings:

Figure 2: Density plot of all 9550 difficulty-ratings recorded in the study.

Next we see a violin plot of mean difficulty-ratings. Each dot is a single sentence; its vertical height is the mean of the ratings for the ten participants who rated it. We have separated the sentences into the 315 that were not selected as members of an optimal matching and the 640 sentences that were selected.

Figure 3: Violin plot of mean difficulty ratings, by status of sentence (selected for Corpus vs. not selected).

The sentences that were not selected vary a bit more in mean difficulty than sentences that were selected. This makes sense, for when a sentence is of unusually high or low difficulty there is less likelihood of finding—among the sentences of the other speaker in the speaker-pair—a sentence similar to it in difficulty.

The following box plots compare the mean difficulty-ratings of sentences spoken by foreign and native speakers, including those sentences that were not selected as members of an optimal pairing:

Figure 4: Boxplots of mean ratings for sentences by foreign and natives speakers.

There appears to be essentially no difference in centers of the distributions; see also the following table:

Table 1: Mean and median of mean ratings for sentences, by type of speaker.
speaker_accent	mean	median	n
Foreign	-14.36572	-14.6020	479
Native	-15.00132	-15.0975	476

Papadimitriou, C., and K. Steiglitz. 1982. Combinatorial Optimization: Algorithms and Complexity. Englewood cliffs, NJ: Prentice Hall.

R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org.