Exercise 9

Complete all Exercises, and submit answers to VtopBeta

Introduction

Locality Sensitive Hashing is a method of performing probabilistic dimension reduction of high-dimensional data.

Dataset

A minhash function converts tokenized text into a set of hash integers, then selects the minimum value. This is the equivalent of randomly selecting a token.

The function then does the same thing repeatedly with different hashing functions, in effect selecting n random shingles. The additional hashing functions come from a bitwise XOR with random integers. That is why the minhash_generator() accepts a seed, so that we can re-create the same minhash function again. In other words, a minhash function converts a set of tokens of any length into n randomly selected and hashed tokens.

library(textreuse)
minhash <- minhash_generator(n = 240, seed = 3552)
head(minhash(c("turn tokens into", "tokens into hashes", "into hashes fast")))

## [1] -1067902788  -349477925 -1306490031  -926753052 -1222296305 -1443723653

Minhash

Now when we load our text corpus, we will tokenize our texts as usual, but we will use our generated minhash() function to compute the hashes. We specify that we want to create a minhash signature by passing our minhash function to the minhash_func = parameter.

dir <- system.file("extdata/ats", package = "textreuse")
corpus <- TextReuseCorpus(dir = dir, tokenizer = tokenize_ngrams, n = 5,
                          minhash_func = minhash, keep_tokens = TRUE,
                          progress = FALSE)

# verify minhash exists in the corpus
corpus

## TextReuseCorpus
## Number of documents: 8 
## hash_func : hash_string 
## minhash_func : minhash 
## tokenizer : tokenize_ngrams

head(minhashes(corpus[[1]])) # minhash signature of the document

## [1] -2147421589 -2147475954 -2147470602 -2147373716 -2147464099 -2147477501

length(minhashes(corpus[[1]]))

## [1] 240

Locality Sensitive Hashing

Now all our documents are represented by n = 240 randomly selected and hashed shingles. Comparing those shingles should be the equivalent of finding the Jaccard similarity of the two documents. However, we still have the problem of pairwise comparison.

The locality-sensitive hashing algorithm, provided in this package by the lsh() function, solves this problem. LSH breaks the minhashes into a series of bands comprised of rows. For example, 200 minhashes might broken into 50 bands of 4 rows each. Each band is hashed to a bucket. If two documents have the exact same minhashes in a band, they will be hashed to the same bucket, and so will be considered candidate pairs. Each pair of documents has as many chances to be considered a candidate as their are bands, and the fewer rows there are in each band, the more likely it is that each document will match another. We can calculate the likely threshold based on the number of minhashes and bands that we are using.

lsh_threshold(h = 200, b = 50)

## [1] 0.3760603

lsh_threshold(h = 240, b = 80)

## [1] 0.2320794

lsh_probability(h = 240, b = 80, s = 0.25) #similarity of over 0.25

## [1] 0.7163087

lsh_probability(h = 240, b =  80, s = 0.75) # similarity of over 0.75

## [1] 1

These numbers seem reasonable for our purposes, so we will set the number of minhashes at 240 and the number of bands at 80.

Now we can use the lsh() function to calculate the locality-sensitive hashes for our documents.

buckets <- lsh(corpus, bands = 80, progress = FALSE)
buckets

## # A tibble: 640 x 2
##    doc                buckets                         
##    <chr>              <chr>                           
##  1 calltounconv00baxt 47c0a03caefcafe1749399b967583925
##  2 calltounconv00baxt 7e0f2ae5f606bbae2085966065505057
##  3 calltounconv00baxt 4ce11e168463662547df79f0c1c52ba3
##  4 calltounconv00baxt 1197924ad432897b8f89643c35077c66
##  5 calltounconv00baxt b4fedffa4a13bc7b36c91eefb8c3c452
##  6 calltounconv00baxt 43415e07b892397e336a382c386fcad6
##  7 calltounconv00baxt 693ab22baa328a19b795278aa5e4abc2
##  8 calltounconv00baxt 339a856bb257af23432f0bd64283a021
##  9 calltounconv00baxt 3984470289474583507ee2ad1cc26118
## 10 calltounconv00baxt 3ccdda855b41b7d695169a32ec4ceded
## # ... with 630 more rows

We can extract the potential matches from the cache using lsh_query() or lsh_candidates(). The first function returns matches for only one document, specified by its ID; the second functions returns all potential pairs of matches.

baxter_matches <- lsh_query(buckets, "calltounconv00baxt") # ID of the corpus
baxter_matches

## # A tibble: 1 x 2
##   a                  b                     
##   <chr>              <chr>                 
## 1 calltounconv00baxt lifeofrevrichard00baxt

candidates <- lsh_candidates(buckets)
candidates

## # A tibble: 3 x 3
##   a                     b                      score
##   <chr>                 <chr>                  <dbl>
## 1 calltounconv00baxt    lifeofrevrichard00baxt    NA
## 2 practicalthought00nev thoughtsonpopery00nevi    NA
## 3 remember00palm        remembermeorholy00palm    NA

Inference

Notice that LSH has identified the same three pairs of documents as potential matches that we found with pairwise comparisons, but did so much faster. But we do not have similarity scores; we only know that these documents are likely to have Jaccard similarity scores above the 0.232 threshold.

Now we can use lsh_compare() to apply a similarity function to the candidate pairs of documents. Note that we only have to do 3 comparisons for all the candidates, instead of 28 pairs when comparing all 8 documents in the corpus pairwise.

lsh_compare(candidates, corpus, jaccard_similarity, progress = FALSE)

## # A tibble: 3 x 3
##   a                     b                      score
##   <chr>                 <chr>                  <dbl>
## 1 calltounconv00baxt    lifeofrevrichard00baxt 0.281
## 2 practicalthought00nev thoughtsonpopery00nevi 0.463
## 3 remember00palm        remembermeorholy00palm 0.701