We have a known problem of mistakes in the OCR coupled with section numbers, page numbers, and the occasional stray character. If we throw away any n-gram that includes a character outside the range of a-z (plus space), what proportion of the n-grams is left? This is a rough proxy for how good the OCR is.

codes_grams %>% lapply(function(x) {
(x %>% filter_unreasonable_ngrams %>% length) / (x %>% length)
}) 
## $CA1851.txt ## [1] 0.7979 ## ##$MO1849.txt
## [1] 0.7535
##
## $NC1868.txt ## [1] 0.8104 ## ##$NY1848.txt
## [1] 0.6918
##
## $NY1850c.txt ## [1] 0.7828 ## ##$UT1870.txt
## [1] 0.7278

Using tessaract instead of other OCR solutions, we’re keeping about 80% of the NY 1850 code, instead of roughly 70% using our older method. Since that number is the denominator when we calculate the proportion of the code that has a match, this explains why the proportion grew less than we might have expected.