Let’s do a quick and dirty test to see if we can get a comparison matrix.

library(magrittr)
library(RWeka)
library(stringr)

Load the codes.

Keep a random sample of three plus NY 1850 to keep computation time low for the test.

codes_sample <- c(sample(codes_texts, 3), NY1850.txt = codes_texts[["NY1850.txt"]])

Function for n-grams:

ngrammify <- function(data, n) { 
  NGramTokenizer(data, Weka_control(min = n, max = n))
  }

Create the n-grams:

codes_grams <- lapply(codes_sample, ngrammify, 5)

Function to remove unreasonable n-grams:

#' Remove unreasonable n-grams containing characters other than letters and spaces
#' @param ngrams A list of n-grams
#' @return Returns a list of filtered n-grams
filter_unreasonable_ngrams <- function(ngrams) {
  require(stringr)
  ngrams[!str_detect(ngrams, "[^a-z ]")]
}

Now apply that function to codes_grams so we get a list of clean n-grams. This might not be the best strategy: we’ll think about that later.

codes_grams <- lapply(codes_grams, filter_unreasonable_ngrams)

Function to compare two codes. Note that this function (unlike earlier versions and just for the purpose of this test) only returns one number for simplicity’s sake instead of a list. (And it no longer applies the filter for unreasonable n-grams inside this function. Instead that is applied globally earlier.)

#' Compare two codes using their list of n-grams.
#' @param orig_code A vector of n-grams representing a code.
#' @param dest_code A vectory of n-grams representing a code.
#' @return A list containing data that might help to identify whether codes
#'   match up.
compare_codes_by_shared_ngrams <- function(orig_code, dest_code) {
  require(magrittr)
  
  matches <- intersect(orig_code, dest_code)
  shared_ngrams <- unique(c(orig_code, dest_code))
  ratio_matches_to_possible <- length(matches) / length(unique(shared_ngrams))
  ratio_matches_to_destination <- length(matches) / length(unique(dest_code))
  return(ratio_matches_to_destination)
}

Now we use a function to create a matrix of comparisons (This is inefficient because it compares each code to itself. And for each possible pairing it does the comparison twice, swapping origin and destination codes. Good enough for now.)

This is to help us understand the comparisons (pretty sure I have the arrow pointing the right way!):

code_names <- str_extract(names(codes_grams), "\\w+")
names_of_comparisons <- outer(code_names, code_names, function(x, y){
  paste(y, "from", x)
}) 
colnames(names_of_comparisons) <- code_names
rownames(names_of_comparisons) <- code_names
print(names_of_comparisons)
##        NV1869               NY1849               NV1861              
## NV1869 "NV1869 from NV1869" "NY1849 from NV1869" "NV1861 from NV1869"
## NY1849 "NV1869 from NY1849" "NY1849 from NY1849" "NV1861 from NY1849"
## NV1861 "NV1869 from NV1861" "NY1849 from NV1861" "NV1861 from NV1861"
## NY1850 "NV1869 from NY1850" "NY1849 from NY1850" "NV1861 from NY1850"
##        NY1850              
## NV1869 "NY1850 from NV1869"
## NY1849 "NY1850 from NY1849"
## NV1861 "NY1850 from NV1861"
## NY1850 "NY1850 from NY1850"

These are the comparisons themselves. The way to read this matrix is “the proportion of ngrams that COLUMN borrows from ROW.” In other words the ROW is the origin code and the COLUMN is the destination code.

# From http://stackoverflow.com/questions/1719447/outer-equivalent-for-non-vector-lists-in-r
outer_for_lists <- function(a,b, fun) {
  outer(a, b, function(x,y) vapply(seq_along(x), function(i) fun(x[[i]], y[[i]]), numeric(1)))
}
comparison_of_codes <- outer_for_lists(codes_grams, codes_grams, compare_codes_by_shared_ngrams)
colnames(comparison_of_codes) <- code_names
rownames(comparison_of_codes) <- code_names
print(comparison_of_codes)
##        NV1869 NY1849 NV1861 NY1850
## NV1869 1.0000 0.1872 0.5605 0.1315
## NY1849 0.1378 1.0000 0.1300 0.1194
## NV1861 0.6389 0.2014 1.0000 0.1400
## NY1850 0.3207 0.3956 0.2996 1.0000

Now we can run this analysis on the whole shebang. First we get all the codes as n-grams:

codes_all_grams <- lapply(codes_texts, ngrammify, 5)
codes_all_grams <- lapply(codes_all_grams, filter_unreasonable_ngrams)

Here is the matrix of comparisons for all the codes:

codes_all_names <- str_extract(names(codes_all_grams), "\\w+")
comparison_of_all_codes <- outer_for_lists(codes_all_grams, codes_all_grams, compare_codes_by_shared_ngrams)
colnames(comparison_of_all_codes) <- codes_all_names
rownames(comparison_of_all_codes) <- codes_all_names

# Normalize comparison of all codes to percentages
comparison_of_all_codes <- round(comparison_of_all_codes * 100, digits = 1)
print(comparison_of_all_codes)
##                AZ1865 CA1850 CA1851 CT1879 FL1870 GA1860 KY1851 MI1853
## AZ1865          100.0   30.4   75.9    4.4   16.9    1.5    6.7    2.4
## CA1850           13.2  100.0   13.9    3.8   16.4    0.9    5.0    1.7
## CA1851           76.8   32.6  100.0    4.4   17.9    1.5    7.2    2.5
## CT1879            0.3    0.6    0.3  100.0    0.4    0.1    0.3    0.1
## FL1870           18.0   40.1   18.8    6.6  100.0    1.4    7.0    5.2
## GA1860            2.2    3.1    2.2    2.9    2.0  100.0    2.0    1.7
## KY1851            6.1   10.5    6.5    3.7    6.0    1.2  100.0    1.6
## MI1853            1.3    2.1    1.3    0.6    2.6    0.6    0.9  100.0
## MO1849            5.4   13.2    5.6    1.7    6.8    0.6    4.4    1.2
## MO1856            4.4   10.8    4.5    5.3    5.9    0.9    4.3    1.9
## NC1868           17.6   40.3   18.4    6.7   44.9    1.7    7.3    2.6
## NV1861           69.0   31.1   69.5    4.5   17.6    1.8    7.1    2.5
## NV1869           64.5   28.9   65.6    4.3   16.9    1.6    6.9    2.4
## NY1848           10.9   28.2   11.5    3.0   19.6    0.7    4.7    1.6
## NY1849           16.2   42.3   17.4    5.1   35.3    1.1    6.3    2.2
## NY1850           37.3   33.3   38.7    6.1   28.0    2.2   11.5    3.2
## OH1853            8.6   16.3    9.1    6.2    9.8    1.8   22.5    1.9
## OH1853extended   11.7   17.1   11.9    7.0   10.6    2.4   23.1    2.5
## UT1853            0.2    0.3    0.2    0.2    0.2    0.1    0.2    0.2
## UT1870           47.9   21.5   47.6    3.8   12.6    1.3    5.4    2.1
##                MO1849 MO1856 NC1868 NV1861 NV1869 NY1848 NY1849 NY1850
## AZ1865           13.8    7.8   14.0   54.4   57.9   20.0   19.8   13.7
## CA1850           14.6    8.3   13.9   10.6   11.2   22.5   22.4    5.3
## CA1851           14.5    8.1   14.8   55.4   59.6   21.4   21.6   14.4
## CT1879            0.3    0.6    0.4    0.2    0.3    0.4    0.4    0.2
## FL1870           18.6   11.1   38.1   14.8   16.1   38.3   45.8   10.9
## GA1860            2.1    2.4    2.1    2.1    2.2    2.0    1.9    1.2
## KY1851           10.2    6.9    5.3    5.1    5.6    7.9    7.0    3.9
## MI1853            1.7    1.8    1.1    1.1    1.2    1.6    1.4    0.6
## MO1849          100.0   20.0    6.2    4.4    4.6   14.7   10.1    2.5
## MO1856           28.8  100.0    5.1    3.7    4.0    8.8    6.7    2.0
## NC1868           19.7   11.2  100.0   14.5   15.7   33.2   41.4   10.5
## NV1861           14.1    8.4   14.7  100.0   63.9   20.2   20.1   14.0
## NV1869           12.9    8.0   14.0   56.1  100.0   19.0   18.7   13.2
## NY1848           20.5    8.5   14.4    8.7    9.3  100.0   36.3    7.5
## NY1849           21.1    9.8   27.0   13.0   13.8   54.5  100.0   11.9
## NY1850           17.2    9.6   22.6   30.0   32.1   37.6   39.6  100.0
## OH1853           11.0    8.3    8.7    7.3    8.0   10.8   10.9    5.0
## OH1853extended   11.5    8.9    9.6    9.8   10.6   11.6   11.8    6.9
## UT1853            0.3    0.3    0.2    0.2    0.2    0.2    0.2    0.1
## UT1870            9.6    6.3   10.3   42.0   55.9   14.2   14.0    9.8
##                OH1853 OH1853extended UT1853 UT1870
## AZ1865            9.4            8.3    2.2   41.8
## CA1850            7.7            5.3    1.5    8.1
## CA1851           10.1            8.6    2.2   42.0
## CT1879            0.4            0.3    0.1    0.2
## FL1870           11.4            8.1    2.1   11.6
## GA1860            2.9            2.5    2.4    1.7
## KY1851           22.4           15.1    1.6    4.3
## MI1853            1.1            0.9    1.1    1.0
## MO1849            4.7            3.2    1.2    3.3
## MO1856            5.1            3.6    1.7    3.1
## NC1868           11.9            8.6    2.3   11.2
## NV1861           10.2            8.9    2.7   46.5
## NV1869            9.7            8.4    2.2   54.3
## NY1848            6.5            4.5    1.1    6.7
## NY1849            9.8            6.9    1.4   10.0
## NY1850           15.0           13.4    3.5   23.2
## OH1853          100.0           65.3    2.4    6.2
## OH1853extended  100.0          100.0    3.8    8.4
## UT1853            0.2            0.3  100.0    0.2
## UT1870            7.8            6.9    2.3  100.0

Let’s write that comparison to disk for someone who know about the history of codes to make sense of it.

write.csv(comparison_of_all_codes, "out/comparison_of_all_codes.csv")

What we really care about is just the triangular comparison.

triangle <- comparison_of_all_codes
triangle[upper.tri(triangle, diag = TRUE)]  <- NA
print(triangle)
##                AZ1865 CA1850 CA1851 CT1879 FL1870 GA1860 KY1851 MI1853
## AZ1865             NA     NA     NA     NA     NA     NA     NA     NA
## CA1850           13.2     NA     NA     NA     NA     NA     NA     NA
## CA1851           76.8   32.6     NA     NA     NA     NA     NA     NA
## CT1879            0.3    0.6    0.3     NA     NA     NA     NA     NA
## FL1870           18.0   40.1   18.8    6.6     NA     NA     NA     NA
## GA1860            2.2    3.1    2.2    2.9    2.0     NA     NA     NA
## KY1851            6.1   10.5    6.5    3.7    6.0    1.2     NA     NA
## MI1853            1.3    2.1    1.3    0.6    2.6    0.6    0.9     NA
## MO1849            5.4   13.2    5.6    1.7    6.8    0.6    4.4    1.2
## MO1856            4.4   10.8    4.5    5.3    5.9    0.9    4.3    1.9
## NC1868           17.6   40.3   18.4    6.7   44.9    1.7    7.3    2.6
## NV1861           69.0   31.1   69.5    4.5   17.6    1.8    7.1    2.5
## NV1869           64.5   28.9   65.6    4.3   16.9    1.6    6.9    2.4
## NY1848           10.9   28.2   11.5    3.0   19.6    0.7    4.7    1.6
## NY1849           16.2   42.3   17.4    5.1   35.3    1.1    6.3    2.2
## NY1850           37.3   33.3   38.7    6.1   28.0    2.2   11.5    3.2
## OH1853            8.6   16.3    9.1    6.2    9.8    1.8   22.5    1.9
## OH1853extended   11.7   17.1   11.9    7.0   10.6    2.4   23.1    2.5
## UT1853            0.2    0.3    0.2    0.2    0.2    0.1    0.2    0.2
## UT1870           47.9   21.5   47.6    3.8   12.6    1.3    5.4    2.1
##                MO1849 MO1856 NC1868 NV1861 NV1869 NY1848 NY1849 NY1850
## AZ1865             NA     NA     NA     NA     NA     NA     NA     NA
## CA1850             NA     NA     NA     NA     NA     NA     NA     NA
## CA1851             NA     NA     NA     NA     NA     NA     NA     NA
## CT1879             NA     NA     NA     NA     NA     NA     NA     NA
## FL1870             NA     NA     NA     NA     NA     NA     NA     NA
## GA1860             NA     NA     NA     NA     NA     NA     NA     NA
## KY1851             NA     NA     NA     NA     NA     NA     NA     NA
## MI1853             NA     NA     NA     NA     NA     NA     NA     NA
## MO1849             NA     NA     NA     NA     NA     NA     NA     NA
## MO1856           28.8     NA     NA     NA     NA     NA     NA     NA
## NC1868           19.7   11.2     NA     NA     NA     NA     NA     NA
## NV1861           14.1    8.4   14.7     NA     NA     NA     NA     NA
## NV1869           12.9    8.0   14.0   56.1     NA     NA     NA     NA
## NY1848           20.5    8.5   14.4    8.7    9.3     NA     NA     NA
## NY1849           21.1    9.8   27.0   13.0   13.8   54.5     NA     NA
## NY1850           17.2    9.6   22.6   30.0   32.1   37.6   39.6     NA
## OH1853           11.0    8.3    8.7    7.3    8.0   10.8   10.9    5.0
## OH1853extended   11.5    8.9    9.6    9.8   10.6   11.6   11.8    6.9
## UT1853            0.3    0.3    0.2    0.2    0.2    0.2    0.2    0.1
## UT1870            9.6    6.3   10.3   42.0   55.9   14.2   14.0    9.8
##                OH1853 OH1853extended UT1853 UT1870
## AZ1865             NA             NA     NA     NA
## CA1850             NA             NA     NA     NA
## CA1851             NA             NA     NA     NA
## CT1879             NA             NA     NA     NA
## FL1870             NA             NA     NA     NA
## GA1860             NA             NA     NA     NA
## KY1851             NA             NA     NA     NA
## MI1853             NA             NA     NA     NA
## MO1849             NA             NA     NA     NA
## MO1856             NA             NA     NA     NA
## NC1868             NA             NA     NA     NA
## NV1861             NA             NA     NA     NA
## NV1869             NA             NA     NA     NA
## NY1848             NA             NA     NA     NA
## NY1849             NA             NA     NA     NA
## NY1850             NA             NA     NA     NA
## OH1853             NA             NA     NA     NA
## OH1853extended  100.0             NA     NA     NA
## UT1853            0.2            0.3     NA     NA
## UT1870            7.8            6.9    2.3     NA

What is the distribution of comparisons?

triangle %>% hist()

plot of chunk unnamed-chunk-15

For each code we can get a sorted list of the codes that it is most like.

for(i in 1:ncol(comparison_of_all_codes)) {
  x <- comparison_of_all_codes[,i]
  title <- colnames(comparison_of_all_codes)[i]
  cat("\n\n", title, " is most like:\n", sep = "")
  cat("---------\n")
  y <- sort(x, decreasing = TRUE)
  for(j in 2:length(y)) {
    cat(names(y)[j], y[j], "\n", sep = " ") 
    }
}
## 
## 
## AZ1865 is most like:
## ---------
## CA1851 76.8 
## NV1861 69 
## NV1869 64.5 
## UT1870 47.9 
## NY1850 37.3 
## FL1870 18 
## NC1868 17.6 
## NY1849 16.2 
## CA1850 13.2 
## OH1853extended 11.7 
## NY1848 10.9 
## OH1853 8.6 
## KY1851 6.1 
## MO1849 5.4 
## MO1856 4.4 
## GA1860 2.2 
## MI1853 1.3 
## CT1879 0.3 
## UT1853 0.2 
## 
## 
## CA1850 is most like:
## ---------
## NY1849 42.3 
## NC1868 40.3 
## FL1870 40.1 
## NY1850 33.3 
## CA1851 32.6 
## NV1861 31.1 
## AZ1865 30.4 
## NV1869 28.9 
## NY1848 28.2 
## UT1870 21.5 
## OH1853extended 17.1 
## OH1853 16.3 
## MO1849 13.2 
## MO1856 10.8 
## KY1851 10.5 
## GA1860 3.1 
## MI1853 2.1 
## CT1879 0.6 
## UT1853 0.3 
## 
## 
## CA1851 is most like:
## ---------
## AZ1865 75.9 
## NV1861 69.5 
## NV1869 65.6 
## UT1870 47.6 
## NY1850 38.7 
## FL1870 18.8 
## NC1868 18.4 
## NY1849 17.4 
## CA1850 13.9 
## OH1853extended 11.9 
## NY1848 11.5 
## OH1853 9.1 
## KY1851 6.5 
## MO1849 5.6 
## MO1856 4.5 
## GA1860 2.2 
## MI1853 1.3 
## CT1879 0.3 
## UT1853 0.2 
## 
## 
## CT1879 is most like:
## ---------
## OH1853extended 7 
## NC1868 6.7 
## FL1870 6.6 
## OH1853 6.2 
## NY1850 6.1 
## MO1856 5.3 
## NY1849 5.1 
## NV1861 4.5 
## AZ1865 4.4 
## CA1851 4.4 
## NV1869 4.3 
## CA1850 3.8 
## UT1870 3.8 
## KY1851 3.7 
## NY1848 3 
## GA1860 2.9 
## MO1849 1.7 
## MI1853 0.6 
## UT1853 0.2 
## 
## 
## FL1870 is most like:
## ---------
## NC1868 44.9 
## NY1849 35.3 
## NY1850 28 
## NY1848 19.6 
## CA1851 17.9 
## NV1861 17.6 
## AZ1865 16.9 
## NV1869 16.9 
## CA1850 16.4 
## UT1870 12.6 
## OH1853extended 10.6 
## OH1853 9.8 
## MO1849 6.8 
## KY1851 6 
## MO1856 5.9 
## MI1853 2.6 
## GA1860 2 
## CT1879 0.4 
## UT1853 0.2 
## 
## 
## GA1860 is most like:
## ---------
## OH1853extended 2.4 
## NY1850 2.2 
## NV1861 1.8 
## OH1853 1.8 
## NC1868 1.7 
## NV1869 1.6 
## AZ1865 1.5 
## CA1851 1.5 
## FL1870 1.4 
## UT1870 1.3 
## KY1851 1.2 
## NY1849 1.1 
## CA1850 0.9 
## MO1856 0.9 
## NY1848 0.7 
## MI1853 0.6 
## MO1849 0.6 
## CT1879 0.1 
## UT1853 0.1 
## 
## 
## KY1851 is most like:
## ---------
## OH1853extended 23.1 
## OH1853 22.5 
## NY1850 11.5 
## NC1868 7.3 
## CA1851 7.2 
## NV1861 7.1 
## FL1870 7 
## NV1869 6.9 
## AZ1865 6.7 
## NY1849 6.3 
## UT1870 5.4 
## CA1850 5 
## NY1848 4.7 
## MO1849 4.4 
## MO1856 4.3 
## GA1860 2 
## MI1853 0.9 
## CT1879 0.3 
## UT1853 0.2 
## 
## 
## MI1853 is most like:
## ---------
## FL1870 5.2 
## NY1850 3.2 
## NC1868 2.6 
## CA1851 2.5 
## NV1861 2.5 
## OH1853extended 2.5 
## AZ1865 2.4 
## NV1869 2.4 
## NY1849 2.2 
## UT1870 2.1 
## MO1856 1.9 
## OH1853 1.9 
## CA1850 1.7 
## GA1860 1.7 
## KY1851 1.6 
## NY1848 1.6 
## MO1849 1.2 
## UT1853 0.2 
## CT1879 0.1 
## 
## 
## MO1849 is most like:
## ---------
## MO1856 28.8 
## NY1849 21.1 
## NY1848 20.5 
## NC1868 19.7 
## FL1870 18.6 
## NY1850 17.2 
## CA1850 14.6 
## CA1851 14.5 
## NV1861 14.1 
## AZ1865 13.8 
## NV1869 12.9 
## OH1853extended 11.5 
## OH1853 11 
## KY1851 10.2 
## UT1870 9.6 
## GA1860 2.1 
## MI1853 1.7 
## CT1879 0.3 
## UT1853 0.3 
## 
## 
## MO1856 is most like:
## ---------
## MO1849 20 
## NC1868 11.2 
## FL1870 11.1 
## NY1849 9.8 
## NY1850 9.6 
## OH1853extended 8.9 
## NY1848 8.5 
## NV1861 8.4 
## CA1850 8.3 
## OH1853 8.3 
## CA1851 8.1 
## NV1869 8 
## AZ1865 7.8 
## KY1851 6.9 
## UT1870 6.3 
## GA1860 2.4 
## MI1853 1.8 
## CT1879 0.6 
## UT1853 0.3 
## 
## 
## NC1868 is most like:
## ---------
## FL1870 38.1 
## NY1849 27 
## NY1850 22.6 
## CA1851 14.8 
## NV1861 14.7 
## NY1848 14.4 
## AZ1865 14 
## NV1869 14 
## CA1850 13.9 
## UT1870 10.3 
## OH1853extended 9.6 
## OH1853 8.7 
## MO1849 6.2 
## KY1851 5.3 
## MO1856 5.1 
## GA1860 2.1 
## MI1853 1.1 
## CT1879 0.4 
## UT1853 0.2 
## 
## 
## NV1861 is most like:
## ---------
## NV1869 56.1 
## CA1851 55.4 
## AZ1865 54.4 
## UT1870 42 
## NY1850 30 
## FL1870 14.8 
## NC1868 14.5 
## NY1849 13 
## CA1850 10.6 
## OH1853extended 9.8 
## NY1848 8.7 
## OH1853 7.3 
## KY1851 5.1 
## MO1849 4.4 
## MO1856 3.7 
## GA1860 2.1 
## MI1853 1.1 
## CT1879 0.2 
## UT1853 0.2 
## 
## 
## NV1869 is most like:
## ---------
## NV1861 63.9 
## CA1851 59.6 
## AZ1865 57.9 
## UT1870 55.9 
## NY1850 32.1 
## FL1870 16.1 
## NC1868 15.7 
## NY1849 13.8 
## CA1850 11.2 
## OH1853extended 10.6 
## NY1848 9.3 
## OH1853 8 
## KY1851 5.6 
## MO1849 4.6 
## MO1856 4 
## GA1860 2.2 
## MI1853 1.2 
## CT1879 0.3 
## UT1853 0.2 
## 
## 
## NY1848 is most like:
## ---------
## NY1849 54.5 
## FL1870 38.3 
## NY1850 37.6 
## NC1868 33.2 
## CA1850 22.5 
## CA1851 21.4 
## NV1861 20.2 
## AZ1865 20 
## NV1869 19 
## MO1849 14.7 
## UT1870 14.2 
## OH1853extended 11.6 
## OH1853 10.8 
## MO1856 8.8 
## KY1851 7.9 
## GA1860 2 
## MI1853 1.6 
## CT1879 0.4 
## UT1853 0.2 
## 
## 
## NY1849 is most like:
## ---------
## FL1870 45.8 
## NC1868 41.4 
## NY1850 39.6 
## NY1848 36.3 
## CA1850 22.4 
## CA1851 21.6 
## NV1861 20.1 
## AZ1865 19.8 
## NV1869 18.7 
## UT1870 14 
## OH1853extended 11.8 
## OH1853 10.9 
## MO1849 10.1 
## KY1851 7 
## MO1856 6.7 
## GA1860 1.9 
## MI1853 1.4 
## CT1879 0.4 
## UT1853 0.2 
## 
## 
## NY1850 is most like:
## ---------
## CA1851 14.4 
## NV1861 14 
## AZ1865 13.7 
## NV1869 13.2 
## NY1849 11.9 
## FL1870 10.9 
## NC1868 10.5 
## UT1870 9.8 
## NY1848 7.5 
## OH1853extended 6.9 
## CA1850 5.3 
## OH1853 5 
## KY1851 3.9 
## MO1849 2.5 
## MO1856 2 
## GA1860 1.2 
## MI1853 0.6 
## CT1879 0.2 
## UT1853 0.1 
## 
## 
## OH1853 is most like:
## ---------
## OH1853extended 100 
## KY1851 22.4 
## NY1850 15 
## NC1868 11.9 
## FL1870 11.4 
## NV1861 10.2 
## CA1851 10.1 
## NY1849 9.8 
## NV1869 9.7 
## AZ1865 9.4 
## UT1870 7.8 
## CA1850 7.7 
## NY1848 6.5 
## MO1856 5.1 
## MO1849 4.7 
## GA1860 2.9 
## MI1853 1.1 
## CT1879 0.4 
## UT1853 0.2 
## 
## 
## OH1853extended is most like:
## ---------
## OH1853 65.3 
## KY1851 15.1 
## NY1850 13.4 
## NV1861 8.9 
## CA1851 8.6 
## NC1868 8.6 
## NV1869 8.4 
## AZ1865 8.3 
## FL1870 8.1 
## NY1849 6.9 
## UT1870 6.9 
## CA1850 5.3 
## NY1848 4.5 
## MO1856 3.6 
## MO1849 3.2 
## GA1860 2.5 
## MI1853 0.9 
## CT1879 0.3 
## UT1853 0.3 
## 
## 
## UT1853 is most like:
## ---------
## OH1853extended 3.8 
## NY1850 3.5 
## NV1861 2.7 
## GA1860 2.4 
## OH1853 2.4 
## NC1868 2.3 
## UT1870 2.3 
## AZ1865 2.2 
## CA1851 2.2 
## NV1869 2.2 
## FL1870 2.1 
## MO1856 1.7 
## KY1851 1.6 
## CA1850 1.5 
## NY1849 1.4 
## MO1849 1.2 
## MI1853 1.1 
## NY1848 1.1 
## CT1879 0.1 
## 
## 
## UT1870 is most like:
## ---------
## NV1869 54.3 
## NV1861 46.5 
## CA1851 42 
## AZ1865 41.8 
## NY1850 23.2 
## FL1870 11.6 
## NC1868 11.2 
## NY1849 10 
## OH1853extended 8.4 
## CA1850 8.1 
## NY1848 6.7 
## OH1853 6.2 
## KY1851 4.3 
## MO1849 3.3 
## MO1856 3.1 
## GA1860 1.7 
## MI1853 1 
## CT1879 0.2 
## UT1853 0.2

Now this is more than we can figure out meaningfully today, but it’s possible to run standard distance and clustering functions on that matrix. I’m not sure what this even means just yet.

Let’s see the standard distance function:

distance <- comparison_of_all_codes %>% dist
print(distance)
##                AZ1865 CA1850 CA1851 CT1879 FL1870 GA1860 KY1851 MI1853
## CA1850         146.83                                                 
## CA1851          33.71 147.01                                          
## CT1879         188.42 146.91 190.66                                   
## FL1870         151.20 110.28 150.93 164.89                            
## GA1860         185.26 145.25 187.40 139.54 163.74                     
## KY1851         173.60 134.85 175.10 143.07 150.82 142.13              
## MI1853         186.89 146.09 189.03 141.00 162.55 139.87 143.34       
## MO1849         170.48 124.86 171.91 143.75 140.69 142.11 133.81 142.37
## MO1856         176.08 132.30 177.76 141.61 148.66 142.30 136.47 142.41
## NC1868         152.95 111.60 152.81 164.28  83.20 162.99 150.38 164.12
## NV1861          56.06 152.68  55.01 194.07 156.18 190.65 178.95 192.49
## NV1869          57.48 150.20  56.15 189.91 153.96 186.63 175.15 188.36
## NY1848         156.74 106.90 157.08 151.59 106.62 149.87 139.90 150.30
## NY1849         153.13 105.07 152.51 165.33  86.92 163.70 150.81 164.42
## NY1850         124.01 128.73 123.41 173.77 122.11 171.38 155.34 172.82
## OH1853         176.15 140.93 177.21 156.71 152.80 155.50 121.12 157.84
## OH1853extended 185.89 158.14 186.78 174.76 167.31 173.10 139.36 175.63
## UT1853         189.85 148.82 192.09 141.21 168.00 139.92 144.71 140.59
## UT1870          85.33 141.96  86.49 173.09 149.50 170.42 161.34 171.83
##                MO1849 MO1856 NC1868 NV1861 NV1869 NY1848 NY1849 NY1850
## CA1850                                                                
## CA1851                                                                
## CT1879                                                                
## FL1870                                                                
## GA1860                                                                
## KY1851                                                                
## MI1853                                                                
## MO1849                                                                
## MO1856         107.43                                                 
## NC1868         140.05 148.35                                          
## NV1861         176.12 181.41 157.86                                   
## NV1869         172.87 177.76 155.64  57.78                            
## NY1848         122.56 135.68 113.47 162.52 159.82                     
## NY1849         136.71 148.31  96.65 158.51 156.65  83.01              
## NY1850         152.32 159.73 125.99 129.38 127.89 124.65 116.90       
## OH1853         146.44 148.45 152.32 180.84 177.63 147.49 153.88 156.68
## OH1853extended 164.44 166.19 166.78 190.07 187.28 164.19 168.69 167.95
## UT1853         144.23 144.41 167.33 195.24 191.27 153.16 167.94 175.61
## UT1870         159.83 163.71 150.80  86.49  70.69 150.89 151.77 129.91
##                OH1853 OH1853extended UT1853
## CA1850                                     
## CA1851                                     
## CT1879                                     
## FL1870                                     
## GA1860                                     
## KY1851                                     
## MI1853                                     
## MO1849                                     
## MO1856                                     
## NC1868                                     
## NV1861                                     
## NV1869                                     
## NY1848                                     
## NY1849                                     
## NY1850                                     
## OH1853                                     
## OH1853extended  35.37                      
## UT1853         159.36         176.81       
## UT1870         166.20         177.80 174.16

And we can then pass that distance matrix to hierarchical clustering or k-means clustering (obviously we need to read up on all these) and plot each.

distance %>% hclust() %>% str()
## List of 7
##  $ merge      : int [1:19, 1:2] -1 -17 -12 -13 -14 -5 -20 -2 -9 6 ...
##  $ height     : num [1:19] 33.7 35.4 56.1 57.8 83 ...
##  $ order      : int [1:20] 20 13 12 1 3 16 5 11 2 14 ...
##  $ labels     : chr [1:20] "AZ1865" "CA1850" "CA1851" "CT1879" ...
##  $ method     : chr "complete"
##  $ call       : language hclust(d = distance)
##  $ dist.method: chr "euclidean"
##  - attr(*, "class")= chr "hclust"
distance %>% hclust() %>% plot()

plot of chunk unnamed-chunk-18

I wouldn’t put any stock in that chart, but it does show what should be possible. And apart from the problem of the chronological arrow, it is interesting that western states are close to one another, the New York codes are close together, the two Ohio codes are at least close together, etc. Time to learn some statistics.

And now for k-means with a plot that I kind of get.

cluster <- kmeans(comparison_of_all_codes, centers = 5)
library(cluster)
clusplot(comparison_of_all_codes, cluster$cluster, labels = 2)

plot of chunk unnamed-chunk-19