We have a known problem of mistakes in the OCR coupled with section numbers, page numbers, and the occasional stray character. If we throw away any n-gram that includes a character outside the range of a-z (plus space), what proportion of the n-grams is left? This is a rough proxy for how good the OCR is.

codes_grams %>% lapply(function(x) {
  (x %>% filter_unreasonable_ngrams %>% length) / (x %>% length)
}) 
## $CA1851.txt
## [1] 0.7979
## 
## $MO1849.txt
## [1] 0.7535
## 
## $NC1868.txt
## [1] 0.8104
## 
## $NY1848.txt
## [1] 0.6918
## 
## $NY1850c.txt
## [1] 0.7828
## 
## $UT1870.txt
## [1] 0.7278

Using tessaract instead of other OCR solutions, we’re keeping about 80% of the NY 1850 code, instead of roughly 70% using our older method. Since that number is the denominator when we calculate the proportion of the code that has a match, this explains why the proportion grew less than we might have expected.