Some preliminary experimentation with the OCR files from the Field Code and with various text analysis packages.

library(magrittr)
library(RWeka)

First we need to load the codes into a useable format and clean them up. Load each code as a character vector with just words converted to lowercase into a list, with the filenames as the names of the list objects

codes_dir <- "text"
files <- dir(codes_dir, "*.txt")
raw <- file.path(codes_dir, files) %>%
  lapply(., scan, "character", sep = "\n")
names(raw) <- files
codes_texts <- lapply(raw, paste, collapse = " ") %>%
  lapply(., tolower) %>%
  lapply(., WordTokenizer) %>%
  lapply(., paste, collapse = " ")

A helper function to create n-grams using the function provided in RWeka. Note that the documentation says that the maximum for an n-gram is 3, but apparently that is not the case.

ngrammify <- function(data, n) { 
  NGramTokenizer(data, Weka_control(min = n, max = n))
  }

Create the n-grams, then create a list of all the unique n-grams. We’re going to create just 5-grams on the suspicion that that is a good length.

codes_grams <- lapply(codes_texts, ngrammify, 5)
every_grams <- codes_grams %>% unlist() %>% unique()

How many unique n-grams are there in total, and how many in each code?

length_all_ngrams <- every_grams %>% length()
length_codes_ngrams <- codes_grams %>% lapply(., length)

How many fewer unique n-grams are there than the total of the n-grams in all the codes? I.e., how many matches are there?

length_codes_ngrams %>% unlist %>% sum - length_all_ngrams
## [1] 364405

The n-grams in each code comprise what percentage of the total?

length_codes_ngrams %>% lapply(., "/", length_all_ngrams)
## $AZ1865.txt
## [1] 0.07909
## 
## $CA1850.txt
## [1] 0.03175
## 
## $CA1851.txt
## [1] 0.08046
## 
## $CT1879.txt
## [1] 0.004903
## 
## $FL1870.txt
## [1] 0.08023
## 
## $GA1860.txt
## [1] 0.1272
## 
## $KY1851.txt
## [1] 0.07609
## 
## $MI1853.txt
## [1] 0.04872
## 
## $MO1849.txt
## [1] 0.03202
## 
## $MO1856.txt
## [1] 0.04438
## 
## $NC1868.txt
## [1] 0.09906
## 
## $NV1861.txt
## [1] 0.09538
## 
## $NV1869.txt
## [1] 0.08814
## 
## $NY1848.txt
## [1] 0.04965
## 
## $NY1849.txt
## [1] 0.07038
## 
## $NY1850.txt
## [1] 0.2833
## 
## $OH1853.txt
## [1] 0.07102
## 
## $OH1853extended.txt
## [1] 0.1135
## 
## $UT1853.txt
## [1] 0.007082
## 
## $UT1870.txt
## [1] 0.08966

Up till now we have been comparing codes with the total universe of codes. Now let’s compare one code to another. The NC 1868 code is probably borrowed from the NY 1850 code. How many n-grams in the NC code are in the NY code? What is the percentage of matches to possible matches (i.e., the unique list of n-grams in both sets). What do these matches look like? (Save them to disk for Kellen to look at.)

ny_to_nc_matches <- intersect(codes_grams$NC1868.txt, codes_grams$NY1850.txt) 
ny_to_nc_possible <- unique(c(codes_grams$NC1868.txt, codes_grams$NY1850.txt))

ny_to_nc_matches %>% length
## [1] 9729
ny_to_nc_matches %>% length / ny_to_nc_possible %>% length
## [1] 0.04643
ny_to_nc_matches %>% length / length_codes_ngrams$NC1868.txt
## [1] 0.1542
sample(ny_to_nc_matches, 10)
##  [1] "and which is in the"              "residue of the attached property"
##  [3] "between the parties the cause"    "to whom the service is"          
##  [5] "to appoint and remove guardians"  "after notice thereof relieve a"  
##  [7] "or to become due or"              "defendant in respect to whom"    
##  [9] "answer has been received the"     "relating to the merits of"
write.csv(ny_to_nc_matches, file = "out/ny_to_nc.matches.csv")

What if we do the same thing for New York to Michigan?

ny_to_mi_matches <- intersect(codes_grams$MI1853.txt, codes_grams$NY1850.txt) 
ny_to_mi_possible <- unique(c(codes_grams$MI1853.txt, codes_grams$NY1850.txt))

ny_to_mi_matches %>% length
## [1] 635
ny_to_mi_matches %>% length / ny_to_nc_possible %>% length
## [1] 0.00303
ny_to_mi_matches %>% length / length_codes_ngrams$MI1853.txt
## [1] 0.02046
sample(ny_to_mi_matches, 10)
##  [1] "on the ground of adultery"   "discretion of the court and"
##  [3] "from his place of residence" "at the same time or"        
##  [5] "the com- mencement of the"   "be served on the attorney"  
##  [7] "the records of the court"    "if the parties do not"      
##  [9] "of the age of fourteen"      "when a defendant has been"
write.csv(ny_to_mi_matches, file = "out/ny_to_mi.matches.csv")

That kind of analysis might be generally useful, so let’s turn it into a function. First a helper function that removes elements from the list of n-grams. A lot of the n-grams contain non-word characters, numbers, or gibberish from the OCR. We want to know the proportion of matches that could be reasonably matched, so we’ll exclude n-grams containing any of those characters.

#' Remove unreasonable n-grams containing characters other than letters and spaces
#' @param ngrams A list of n-grams
#' @return Returns a list of filtered n-grams
filter_unreasonable_ngrams <- function(ngrams) {
  require(stringr)
  ngrams[!str_detect(ngrams, "[^a-z ]")]
}

Now the main function for comparing two codes:

#' Compare two codes using their list of n-grams.
#' @param orig_code A vector of n-grams representing a code.
#' @param dest_code A vectory of n-grams representing a code.
#' @return A list containing data that might help to identify whether codes
#'   match up.
compare_codes_by_shared_ngrams <- function(orig_code, dest_code) {
  require(magrittr)
  
  matches <- intersect(orig_code, dest_code)
  shared_ngrams <- unique(c(orig_code, dest_code))
  ratio_matches_to_possible <- length(matches) / length(filter_unreasonable_ngrams(shared_ngrams))
  ratio_matches_to_destination <- length(matches) / length(filter_unreasonable_ngrams(dest_code))
  list(matches = matches,
       shared_ngrams = shared_ngrams,
       ratio_matches_to_possible = ratio_matches_to_possible,
       ratio_matches_to_destination = ratio_matches_to_destination)
}

It sure would be nice to be able to take an n-gram that we know is a match and to find the context in two codes. I’m going to assume that you’ve found a interesting match elsewhere, probably using the compare_codes_by_shared_ngrams() function, and want to investigate it further. This function takes the n-gram as a string and the two codes to look for it in. (Technically, I suppose, we’re just implementing a generic search function.)

kwic_ngram_match <- function(ngram, code_1, code_2, disp_chars = 100) {
  require(stringr)
  # The fixed() function from stringr indicates that we are using literal
  # characters rather than a regex.
  match_code_1 <- str_locate_all(code_1, fixed(ngram))[[1]]
  match_code_2 <- str_locate_all(code_2, fixed(ngram))[[1]]
  
  match_code_1[,1] <- match_code_1[,1] - disp_chars
  match_code_1[,2] <- match_code_1[,2] + disp_chars
  match_code_2[,1] <- match_code_2[,1] - disp_chars
  match_code_2[,2] <- match_code_2[,2] + disp_chars
  
  extract_code_1 <- str_sub(code_1, match_code_1[,1], match_code_1[,2])
  extract_code_2 <- str_sub(code_2, match_code_2[,1], match_code_2[,2])
  
  data.frame(ngram = ngram,
             orig_code = extract_code_1,
             dest_code = extract_code_2)
}

Now that we’ve written a KWIC function, we can apply it to the comparison of the NY and NC sample codes.

First we get a comparison of the two codes:

cf_ny_nc <- compare_codes_by_shared_ngrams(codes_grams$NY1850.txt,
                                           codes_grams$NC1868.txt)
## Loading required package: stringr
str(cf_ny_nc)
## List of 4
##  $ matches                     : chr [1:9729] "the code of civil pro-" "code of civil pro- cedure" "jurisdiction of the courts of" "the superior court of the" ...
##  $ shared_ngrams               : chr [1:209545] "table of contents preliminary provisions" "of contents preliminary provisions -" "contents preliminary provisions - -" "preliminary provisions - - -" ...
##  $ ratio_matches_to_possible   : num 0.0737
##  $ ratio_matches_to_destination: num 0.195

Now let’s get a random sample of matches:

sample_matches <- sample(cf_ny_nc$matches, 20)
sample_matches
##  [1] "the county where the judgment"      
##  [2] "be found according to the"          
##  [3] "in an action for the"               
##  [4] "to form a belief 2"                 
##  [5] "upon an appeal from a"              
##  [6] "the damages which he shall"         
##  [7] "avoid the service of a"             
##  [8] "there is no person in"              
##  [9] "acceptance be not given the"        
## [10] "heard at the same term"             
## [11] "judgment in the same manner"        
## [12] "be arrested in a civil"             
## [13] "him a demand for the"               
## [14] "their use an action for"            
## [15] "a defect of parties plaintiff"      
## [16] "to the sheriff who shall"           
## [17] "adultery must be tried by"          
## [18] "collect the sum deposited as"       
## [19] "for the defendant upon an"          
## [20] "interest in the controversy adverse"

Now let’s apply our KWIC function to those sample matches and get a comparison of the two codes in tabular format. (If the length of this table is longer than the sample matches, it is because the match appears more than once in each code.)

# TEMPORARILY REMOVED
# comparison_of_matches <- sample_matches %>% lapply(., kwic_ngram_match,
#                           codes_texts$NY1850.txt,
#                           codes_texts$NC1868.txt,
#                           disp_chars = 300) %>%
#                             do.call(rbind.data.frame, .)
# comparison_of_matches
# write.csv(comparison_of_matches, file = "out/matches_ny_nc_kwic.csv")

Now let’s create a density map of where the matching n-grams are in a code. Interesting that the beginning and end of the code have fewer matches.

density_plot_color <- rgb(0.5, 0.5, 0.5, 0.1)
nc_matches <- codes_grams$NC1868.txt %in% cf_ny_nc$matches  
nc_matches %>% 
  plot(type = "h",
       col = density_plot_color,
       ann = FALSE,
       yaxt = "n",
       xaxt = "n")
title("Places in NC 1868 Code with Matches in NY 1850 Code",
      xlab = "Position in code",
      ylab = "")

plot of chunk unnamed-chunk-16

Let’s try this with the California 1851 code.

cf_ny_ca <- compare_codes_by_shared_ngrams(codes_grams$NY1850.txt,
                                           codes_grams$CA1851.txt)
str(cf_ny_ca)
## List of 4
##  $ matches                     : chr [1:13313] "the superior court of the" "superior court of the city" "court of the city of" "by one or more of" ...
##  $ shared_ngrams               : chr [1:193662] "table of contents preliminary provisions" "of contents preliminary provisions -" "contents preliminary provisions - -" "preliminary provisions - - -" ...
##  $ ratio_matches_to_possible   : num 0.111
##  $ ratio_matches_to_destination: num 0.315

And the plot comparing California and New York:

density_plot_color <- rgb(0.5, 0.5, 0.5, 0.1)
ca_matches <- codes_grams$CA1851.txt %in% cf_ny_ca$matches  
ca_matches %>% 
  plot(type = "h",
       col = density_plot_color,
       ann = FALSE,
       yaxt = "n",
       xaxt = "n")
title("Places in CA 1851 Code with Matches in NY 1850 Code",
      xlab = "Position in code",
      ylab = "")

plot of chunk unnamed-chunk-18

We can do the same thing with the MI 1853 code and find that it’s not a very good match to the NY 1850 code. The matches that are there are probably just noise or common legal terms and English phrases.

cf_ny_mi <- compare_codes_by_shared_ngrams(codes_grams$NY1850.txt,
                                           codes_grams$MI1853.txt)
mi_matches <- codes_grams$MI1853.txt %in% cf_ny_mi$matches  
mi_matches %>% 
  plot(type = "h",
       col = density_plot_color,
       ann = FALSE,
       yaxt = "n",
       xaxt = "n")
title("Places in MI 1853 Code with Matches in NY 1850 Code",
      xlab = "Position in code",
      ylab = "")

plot of chunk unnamed-chunk-19