Describes the procedure for generating optimal matchings of sentences within pairs of speakers in the Bluegrass Corpus.
The Bluegrass Corpus includes a set of 40 pairs of TED-talk speakers who are similar in many respects but who differ in whether or not their accent is deemed “foreign” or “native.” Each speaker has twelve video clips in which he or she is speaking a sentence. All 24 sentences associated with a particular pair of speakers were rated for difficulty by ten participants. For each speaker-pair we chose eight sentences from each speaker and matched them with each other—a sentence from the “native” speaker to a sentence from the “foreign” speakers—in such a way that the mean rated difficulties of the sentences in each pair are as similar as possible. In keeping with the ideals of reproducible research, this technical report aims to describe the matching procedure precisely, so that other researchers can assess the quality of the procedure and can use the procedure to reproduce the matching.
The procedure is implemented in the R programming language R Core Team (2019). There are two ways for R-users to reproduce the matching:
git may download or clone the author’s project repository https://github.com/homerhanumat/assignment on Github. In the root directory of the project, find the R Markdown source file for the article and knit it.In order to use the matching algorithm, make sure you have installed some packages from CRAN:
install.packages(
c(
"readxl", ## for data import
"clue", ## implement the Hungarian Algorithm
"DescTools", ## Yuen's trimmed t-test
"tidyverse" ## packages for data wrangling and graphics
)
)
You will want to attach the tidyverse packages:
First we download the data. This can be done manually from the Open Science Framework https://osf.io/fd4uj/download, or in R-code from the author’s Github repository:
## make a directory to store data:
if (!dir.exists("data")) {
dir.create("data")
}
## get the excel file:
download.file(
url = "https://github.com/homerhanumat/assignment/raw/master/data/Experiment3.xlsx",
destfile = "data/Experiment3.xlsx"
)
(Note that the data resides in a folder named data.)
Then we read the data into our R session:
sentences1 <- readxl::read_excel("data/Experiment3.xlsx")
Some data-munging:
sentences <-
sentences1 %>%
## pair-id came in as character vector, so change to integers:
mutate(Pair = as.integer(Pair)) %>%
## ditch most of the columns:
select(Participant, Pair, Sentence, Accent, Condition, Rating) %>%
## rename most of the variables
rename(participant_id = Participant,
pair_id = Pair,
sentence_id = Sentence,
speaker_accent = Accent,
difficulty = Condition,
rating = Rating) %>%
## change rating scale from [-1, 1] to [100, 100]
mutate(rating = rating * 100) %>%
## reshape data so that each row repreents a single sentence:
arrange(sentence_id) %>%
mutate(participant = rep(1:10, times = 960)) %>%
select(-participant_id) %>%
spread(key = participant, value = rating, sep = "_rating_")
## give better names to the columns containing particpant ratings:
names(sentences)[5:14] <- paste("participant", 1:10, "rating", sep = "_")
## rearrange for easier viewing
sentences <-
sentences %>%
arrange(pair_id)
## Duplicate suentences, discovered by Bailey McGuffey:
## Eliminate the repeated sentences
## from the dataset:
bad_ids <- c("14AE3", "33BE5", "45BD11", "33BE4", "33BD9")
sentences <-
sentences %>%
filter(!(sentence_id %in% bad_ids))
Here is the transformed data:
For almost every speaker-pair, we had access to twelve sentences from each speaker in the pair. (Three sentences had to be removed from the original data, resulting in three instances in which a speaker had only eleven sentences.)
For each speaker-pair, we wish to desire to find a set of eight sentences from the Foreign speaker and to match them one-to-one with the members of a set of eight sentences from the Native speaker, in such a way that the resulting matched sentences are as similar as possible in terms of their difficulty-ratings as given by the ten subjects who heard the sentences of both speakers.
As a criterion for similarity of difficulty of two given sentences, we use the absolute value of the t-statistic in a paired t-test involving the ten ratings for each sentence. As a measure of overall similarity in a proposed matching, we use the sum of the eight absolute values. The smaller this sum, the more “similar” we deem the matching to be.
To get a better idea of what we are attempting, consider the following matrix, which pertains to the pair of speakers with ID-number 1:
| 1BD10 | 1BD11 | 1BD12 | 1BD7 | 1BD8 | 1BD9 | 1BE1 | 1BE2 | 1BE3 | 1BE4 | 1BE5 | 1BE6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1AD10 | 0.328 | 1.744 | 1.325 | 1.128 | 1.872 | 0.499 | 0.944 | 0.257 | 1.107 | 1.374 | 0.870 | 1.042 |
| 1AD11 | 0.609 | 1.034 | 0.187 | 0.051 | 0.112 | 0.370 | 0.047 | 0.808 | 0.169 | 0.073 | 0.022 | 0.075 |
| 1AD12 | 1.118 | 0.439 | 0.291 | 0.072 | 0.241 | 0.310 | 0.032 | 0.811 | 0.481 | 0.116 | 0.062 | 0.090 |
| 1AD7 | 1.204 | 0.264 | 0.601 | 0.457 | 0.518 | 1.168 | 0.897 | 1.285 | 0.277 | 0.445 | 0.437 | 0.713 |
| 1AD8 | 1.595 | 0.026 | 0.947 | 0.449 | 0.589 | 0.686 | 0.477 | 1.585 | 0.183 | 0.323 | 0.339 | 0.588 |
| 1AD9 | 2.041 | 0.645 | 0.987 | 0.924 | 1.088 | 1.213 | 1.101 | 1.650 | 0.815 | 0.787 | 1.057 | 1.333 |
| 1AE1 | 1.952 | 0.163 | 0.809 | 0.117 | 0.153 | 0.520 | 0.335 | 1.125 | 0.055 | 0.136 | 0.147 | 0.487 |
| 1AE2 | 1.141 | 0.208 | 0.662 | 0.368 | 0.472 | 1.011 | 0.683 | 1.299 | 0.246 | 0.530 | 0.399 | 0.546 |
| 1AE3 | 1.311 | 0.464 | 0.901 | 0.534 | 0.659 | 1.294 | 0.846 | 1.786 | 0.387 | 1.223 | 0.582 | 0.804 |
| 1AE4 | 0.077 | 1.176 | 0.838 | 0.732 | 1.359 | 0.294 | 0.799 | 0.093 | 1.313 | 0.951 | 0.697 | 0.808 |
| 1AE5 | 0.563 | 2.325 | 1.352 | 2.016 | 1.976 | 1.007 | 1.139 | 1.057 | 1.237 | 1.665 | 1.052 | 1.441 |
| 1AE6 | 0.409 | 0.641 | 0.156 | 0.320 | 0.800 | 0.016 | 0.527 | 0.414 | 0.687 | 0.491 | 0.379 | 0.480 |
The row-names of the above matrix are sentence-IDs for the Foreign speaker; the column names are IDs of the sentences spoken by the Native speaker. Each cell is the absolute value of the paired t-test applied to a sentence-pair. Thus, for example, in a paired t-test for sentences 1AD10 and 1BD10 the absolute value of the t-test statistic was about 0.328.
(As is the case for most of the speakers, the matrix is 12-by-12 since we had twelve sentences from each speaker. However, three of the speaker-pairs had misssing sentences from one of the speakers; for these pairs the matrix was correspondingly smaller.)
In any event, our task is find a set of eight rows and a set of eight columns and a matching between them so as to make the sum of the corresponding eight cells as small as possible.
For the matrix above, the best choice turns out to be as follows:
| 1BD10 | 1BD11 | 1BD12 | 1BD7 | 1BD8 | 1BD9 | 1BE1 | 1BE2 | 1BE3 | 1BE4 | 1BE5 | 1BE6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1AD10 | 0.257 | |||||||||||
| 1AD11 | 0.022 | |||||||||||
| 1AD12 | 0.032 | |||||||||||
| 1AD7 | ||||||||||||
| 1AD8 | 0.026 | |||||||||||
| 1AD9 | ||||||||||||
| 1AE1 | 0.117 | |||||||||||
| 1AE2 | 0.246 | |||||||||||
| 1AE3 | ||||||||||||
| 1AE4 | 0.077 | |||||||||||
| 1AE5 | ||||||||||||
| 1AE6 | 0.016 |
In other words, the best matching is:
| foreign | native |
|---|---|
| 1AD10 | 1BE2 |
| 1AD11 | 1BE5 |
| 1AD12 | 1BE1 |
| 1AD8 | 1BD11 |
| 1AE1 | 1BD7 |
| 1AE2 | 1BE3 |
| 1AE4 | 1BD10 |
| 1AE6 | 1BD9 |
But there are about 9.9 billion matchings to compare! Trying them all one-by-one one takes much too long on most personal computers.
It turns out that problem is an instance of the well-known Assignment Problem in applied combinatorics, and several efficient solutions are available. We choose to use the Hungarian Algorithm, as implemented in R-package clue, authored by Kurt Hornik.
In order to apply the Hungarian Algorithm we must begin with matrix in which each row-item is matched with a column item, so we cannot work directly with the original 12-by-12 matrix: we need an 8-by-12 matrix instead. Therefore we create, for each possible set of eight sentences from the 12 sentences of the Foreign speaker, a matrix in which the rows are the members of the set. One such matrix is as follows:
| 1BD10 | 1BD11 | 1BD12 | 1BD7 | 1BD8 | 1BD9 | 1BE1 | 1BE2 | 1BE3 | 1BE4 | 1BE5 | 1BE6 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1AD10 | 0.328 | 1.744 | 1.325 | 1.128 | 1.872 | 0.499 | 0.944 | 0.257 | 1.107 | 1.374 | 0.870 | 1.042 |
| 1AD11 | 0.609 | 1.034 | 0.187 | 0.051 | 0.112 | 0.370 | 0.047 | 0.808 | 0.169 | 0.073 | 0.022 | 0.075 |
| 1AD12 | 1.118 | 0.439 | 0.291 | 0.072 | 0.241 | 0.310 | 0.032 | 0.811 | 0.481 | 0.116 | 0.062 | 0.090 |
| 1AD7 | 1.204 | 0.264 | 0.601 | 0.457 | 0.518 | 1.168 | 0.897 | 1.285 | 0.277 | 0.445 | 0.437 | 0.713 |
| 1AD8 | 1.595 | 0.026 | 0.947 | 0.449 | 0.589 | 0.686 | 0.477 | 1.585 | 0.183 | 0.323 | 0.339 | 0.588 |
| 1AD9 | 2.041 | 0.645 | 0.987 | 0.924 | 1.088 | 1.213 | 1.101 | 1.650 | 0.815 | 0.787 | 1.057 | 1.333 |
| 1AE1 | 1.952 | 0.163 | 0.809 | 0.117 | 0.153 | 0.520 | 0.335 | 1.125 | 0.055 | 0.136 | 0.147 | 0.487 |
| 1AE2 | 1.141 | 0.208 | 0.662 | 0.368 | 0.472 | 1.011 | 0.683 | 1.299 | 0.246 | 0.530 | 0.399 | 0.546 |
(This matrix corresponds to choosing the first eight sentences from the Foreign speaker.) We apply the Hungarian Algorithm to the matrix, arriving at the following result:
| foreign | native | statistic |
|---|---|---|
| 1AD10 | 1BE2 | 0.2565733 |
| 1AD11 | 1BE5 | 0.0223997 |
| 1AD12 | 1BE1 | 0.0321802 |
| 1AD7 | 1BE3 | 0.2772978 |
| 1AD8 | 1BD11 | 0.0260588 |
| 1AD9 | 1BE4 | 0.7869108 |
| 1AE1 | 1BD8 | 0.1529699 |
| 1AE2 | 1BD7 | 0.3682881 |
Of course we must apply the Hungarian algorithm for every possible set of eight sentences from the Foreign speaker. That’s a total of 495 sets, but the Hungarian algorithm is so efficient that the whole process runs very quickly on a personal computer, as we shall soon see. We retain the set whose best matching to the Native speaker has the highest possible similarity, i.e., the smallest sum of absolute values of t-statistics, arriving at the solution we showed earlier.
We also need to repeat the process for all forty pairs of speakers.
## Utility function to measure the difference in difficulty ratings.
## Use the Yuen t-test that allows for working with trimmed means.
## Returns the test-statistic and the P-value.
profile_diff <- function(x, y, trim) {
test <- DescTools::YuenTTest(x, y, trim = trim, paired = TRUE)
list(
statistic = abs(test$statistic),
p_value = test$p.value
)
}
## Utility function
ratings_from_pair <- function(pair) {
foreign <- pair %>%
filter(speaker_accent == "Foreign") %>%
select(ends_with("rating")) %>%
t() %>%
as.matrix()
native <- pair %>%
filter(speaker_accent == "Native") %>%
select(ends_with("rating")) %>%
t() %>%
as.matrix()
list(foreign = foreign, native = native)
}
## Utility function to make matrices of statistics and P-values
## for all pairs of sentences: one from Speaker A (Foreign), the other
## from speaker B (Native).
make_diff_matrix <- function(pair, trim) {
ratings <- ratings_from_pair(pair)
a <- ratings$foreign
b <- ratings$native
mat_statistics <- matrix(0, nrow = ncol(a), ncol = ncol(b))
mat_pvals <- matrix(0, nrow = ncol(a), ncol = ncol(b))
for (i in 1:ncol(a)) {
for (j in 1:ncol(b)) {
result <- profile_diff(x = a[, i], y = b[, j], trim = trim)
mat_statistics[i, j] <- result$statistic
mat_pvals[i, j] <- result$p_value
}
}
rownames(mat_statistics) <- pair %>%
filter(speaker_accent == "Foreign") %>%
pull(sentence_id)
colnames(mat_statistics) <- pair %>%
filter(speaker_accent == "Native") %>%
pull(sentence_id)
rownames(mat_pvals) <- pair %>%
filter(speaker_accent == "Foreign") %>%
pull(sentence_id)
colnames(mat_pvals) <- pair %>%
filter(speaker_accent == "Native") %>%
pull(sentence_id)
list(
mat_statistics = mat_statistics,
mat_pvals = mat_pvals
)
}
## Utility function. Given a pair of speakers, find the best
## matching of sentences. Uses brute force to go through
## all possible pairs of subsets of cardinality "select" from speaker A,
## with each A-set using the Hungarian Algorithm
## to find the best matching set of sentences from Speaker B.
## Keeps track of the best matching.
## Returns a data frame of results.
get_best <- function(pair, select, trim) {
## get sentence ids for each speaker:
a_sentence_ids <- pair %>%
filter(speaker_accent == "Foreign") %>%
pull(sentence_id)
b_sentence_ids <- pair %>%
filter(speaker_accent == "Native") %>%
pull(sentence_id)
## compute matrices of difference-statisitics and p-values,
## for each pair of sentences
diff_mats <- make_diff_matrix(pair, trim = trim)
## make all possible subsets of cardinality select from
## the sentences for A-speker
size <- nrow(diff_mats[[1]])
a_subsets_numeric <- utils::combn(x = size, m = select)
## prepare to loop through all A-speaker subsets
n <- ncol(a_subsets_numeric)
a_best <- ""
b_best <- ""
statistic <- 0
pval <- 0
diff_best <- Inf
## begin looping
for (i in 1:n) {
## extract sentence ids from the numbers:
subset_a <- a_sentence_ids[a_subsets_numeric[, i]]
## extract the relvant portion of the difference matrices
dms <- diff_mats$mat_statistics[subset_a, ]
## Now comes the Hungarian Algorithm ...
solution <- clue::solve_LSAP(x = dms, maximum = FALSE)
## extract B-speaker sentence-ids from the solution
subset_b <- b_sentence_ids[solution]
## get staitics and p-values for each pair in the
## best matching
sentence_pair_statistics <- numeric(select)
sentence_pair_pvals <- numeric(select)
for (i in 1:select) {
a_location <- subset_a[i]
b_location <- subset_b[i]
sentence_pair_statistics[i] <- diff_mats$mat_statistics[a_location, b_location]
sentence_pair_pvals[i] <- diff_mats$mat_pvals[a_location, b_location]
}
## check to see if this best match is better than a best-match
## froma previously-analysed A-subsets:
sum_statistics <- sum(sentence_pair_statistics)
if (sum_statistics < diff_best) {
diff_best <- sum_statistics
a_best <- subset_a
b_best <- subset_b
statistic <- sentence_pair_statistics
pval <- sentence_pair_pvals
}
}
## return data frame of results for the best possible matching;
data.frame(
foreign = a_best,
native = b_best,
statistic = statistic,
pval = pval)
}
## match_sentences ----
## This is the function you'll actually use.
## data is the original data frame
## select = desired number of sentence-pairs
##
## trim is there in case researchers desire to omit
## very high or low ratings. (Setting trim = 0.1 would knock out
## the highest and the lowest of the ten differences in ratings.)
##
## Setting trace to TRUE results in a progress report to the console
## (not needed for the current small study)
##
## Result is a list of data frames, one for each speaker-pair,
## saying which sentence goes to which, and reporting the P-values
## of the Yuen T-Test. This allows the user to flag pairs where the
## ratings are "too different".
match_sentences <- function(data, select, trim = 0, trace = FALSE) {
pair_ids <- sort(unique(data$pair_id))
pairs <- length(unique(data$pair_id))
lst <- vector(mode = "list", length = pairs)
for (i in 1:pairs) {
if (trace) {
cat("Working on speaker pair with id", pair_ids[i], "...\n")
}
pair <- data %>%
filter(pair_id == pair_ids[i])
results <- get_best(pair, select, trim)
lst[[i]] <- results
}
names(lst) <- pair_ids
lst
}
Now we run the algorithm, using `system.time() to record how long it takes:
system.time(
results <- match_sentences(
data = sentences,
select = 8,
trim = 0, ## the default, actually,
trace = FALSE
)
)
user system elapsed
7.034 0.075 7.164
Thanks to the Hungarian Algorithm the routine finishes in a satisfyingly small amount of time.
results is a list, each element of which is a data frame showing how to do the matching of sentences for a given pair of speakers. We can view the matching for the speaker-pair with id 17 as follows:
results[["17"]]
| foreign | native | statistic | pval |
|---|---|---|---|
| 17AD10 | 17BD8 | 0.1290179 | 0.9001807 |
| 17AD12 | 17BD12 | 0.0017601 | 0.9986340 |
| 17AD9 | 17BE4 | 0.0989698 | 0.9233316 |
| 17AE1 | 17BE1 | 0.2760973 | 0.7887085 |
| 17AE2 | 17BD10 | 0.1696885 | 0.8690081 |
| 17AE3 | 17BD11 | 0.1310586 | 0.8986118 |
| 17AE4 | 17BD9 | 0.0401374 | 0.9688599 |
| 17AE6 | 17BE3 | 0.1955255 | 0.8493247 |
Note that we have retained the P-values for each paired t-test. We can use them as a check on the similarity of sentences. If we find sentence pairs for which the P-values are very small then we may choose not to include them in the Corpus.
Let’s gather all of the P-values:
The smallest P-value is:
min(pvals)
[1] 0.4440327
Most of the P-values are quite high, as we see from the following density plot:
ggplot(data = NULL, aes(x = pvals)) +
geom_density(fill = "burlywood") +
geom_rug() +
labs(x = "P-values")
Figure 1: t-test P-values for all 320 sentence pairs
It appears that the matching routine has succeeded in identifying sentence-pairs that are similar—at least with respect to mean difficulty!
We worked with 955 sentences, each of which were rated by ten study-participants. Here is density plot of all 9550 individual difficulty-ratings:
Figure 2: Density plot of all 9550 difficulty-ratings recorded in the study.
Next we see a violin plot of mean difficulty-ratings. Each dot is a single sentence; its vertical height is the mean of the ratings for the ten participants who rated it. We have separated the sentences into the 315 that were not selected as members of an optimal matching and the 640 sentences that were selected.
Figure 3: Violin plot of mean difficulty ratings, by status of sentence (selected for Corpus vs. not selected).
The sentences that were not selected vary a bit more in mean difficulty than sentences that were selected. This makes sense, for when a sentence is of unusually high or low difficulty there is less likelihood of finding—among the sentences of the other speaker in the speaker-pair—a sentence similar to it in difficulty.
The following box plots compare the mean difficulty-ratings of sentences spoken by foreign and native speakers, including those sentences that were not selected as members of an optimal pairing:
Figure 4: Boxplots of mean ratings for sentences by foreign and natives speakers.
There appears to be essentially no difference in centers of the distributions; see also the following table:
| speaker_accent | mean | median | n |
|---|---|---|---|
| Foreign | -14.36572 | -14.6020 | 479 |
| Native | -15.00132 | -15.0975 | 476 |