Let’s do a quick and dirty test to see if we can get a comparison matrix.
library(magrittr)
library(RWeka)
library(stringr)
Load the codes.
Keep a random sample of three plus NY 1850 to keep computation time low for the test.
codes_sample <- c(sample(codes_texts, 3), NY1850.txt = codes_texts[["NY1850.txt"]])
Function for n-grams:
ngrammify <- function(data, n) {
NGramTokenizer(data, Weka_control(min = n, max = n))
}
Create the n-grams:
codes_grams <- lapply(codes_sample, ngrammify, 5)
Function to remove unreasonable n-grams:
#' Remove unreasonable n-grams containing characters other than letters and spaces
#' @param ngrams A list of n-grams
#' @return Returns a list of filtered n-grams
filter_unreasonable_ngrams <- function(ngrams) {
require(stringr)
ngrams[!str_detect(ngrams, "[^a-z ]")]
}
Now apply that function to codes_grams so we get a list of clean n-grams. This might not be the best strategy: we’ll think about that later.
codes_grams <- lapply(codes_grams, filter_unreasonable_ngrams)
Function to compare two codes. Note that this function (unlike earlier versions and just for the purpose of this test) only returns one number for simplicity’s sake instead of a list. (And it no longer applies the filter for unreasonable n-grams inside this function. Instead that is applied globally earlier.)
#' Compare two codes using their list of n-grams.
#' @param orig_code A vector of n-grams representing a code.
#' @param dest_code A vectory of n-grams representing a code.
#' @return A list containing data that might help to identify whether codes
#' match up.
compare_codes_by_shared_ngrams <- function(orig_code, dest_code) {
require(magrittr)
matches <- intersect(orig_code, dest_code)
shared_ngrams <- unique(c(orig_code, dest_code))
ratio_matches_to_possible <- length(matches) / length(unique(shared_ngrams))
ratio_matches_to_destination <- length(matches) / length(unique(dest_code))
return(ratio_matches_to_destination)
}
Now we use a function to create a matrix of comparisons (This is inefficient because it compares each code to itself. And for each possible pairing it does the comparison twice, swapping origin and destination codes. Good enough for now.)
This is to help us understand the comparisons (pretty sure I have the arrow pointing the right way!):
code_names <- str_extract(names(codes_grams), "\\w+")
names_of_comparisons <- outer(code_names, code_names, function(x, y){
paste(y, "from", x)
})
colnames(names_of_comparisons) <- code_names
rownames(names_of_comparisons) <- code_names
print(names_of_comparisons)
## NV1869 NY1849 NV1861
## NV1869 "NV1869 from NV1869" "NY1849 from NV1869" "NV1861 from NV1869"
## NY1849 "NV1869 from NY1849" "NY1849 from NY1849" "NV1861 from NY1849"
## NV1861 "NV1869 from NV1861" "NY1849 from NV1861" "NV1861 from NV1861"
## NY1850 "NV1869 from NY1850" "NY1849 from NY1850" "NV1861 from NY1850"
## NY1850
## NV1869 "NY1850 from NV1869"
## NY1849 "NY1850 from NY1849"
## NV1861 "NY1850 from NV1861"
## NY1850 "NY1850 from NY1850"
These are the comparisons themselves. The way to read this matrix is “the proportion of ngrams that COLUMN borrows from ROW.” In other words the ROW is the origin code and the COLUMN is the destination code.
# From http://stackoverflow.com/questions/1719447/outer-equivalent-for-non-vector-lists-in-r
outer_for_lists <- function(a,b, fun) {
outer(a, b, function(x,y) vapply(seq_along(x), function(i) fun(x[[i]], y[[i]]), numeric(1)))
}
comparison_of_codes <- outer_for_lists(codes_grams, codes_grams, compare_codes_by_shared_ngrams)
colnames(comparison_of_codes) <- code_names
rownames(comparison_of_codes) <- code_names
print(comparison_of_codes)
## NV1869 NY1849 NV1861 NY1850
## NV1869 1.0000 0.1872 0.5605 0.1315
## NY1849 0.1378 1.0000 0.1300 0.1194
## NV1861 0.6389 0.2014 1.0000 0.1400
## NY1850 0.3207 0.3956 0.2996 1.0000
Now we can run this analysis on the whole shebang. First we get all the codes as n-grams:
codes_all_grams <- lapply(codes_texts, ngrammify, 5)
codes_all_grams <- lapply(codes_all_grams, filter_unreasonable_ngrams)
Here is the matrix of comparisons for all the codes:
codes_all_names <- str_extract(names(codes_all_grams), "\\w+")
comparison_of_all_codes <- outer_for_lists(codes_all_grams, codes_all_grams, compare_codes_by_shared_ngrams)
colnames(comparison_of_all_codes) <- codes_all_names
rownames(comparison_of_all_codes) <- codes_all_names
# Normalize comparison of all codes to percentages
comparison_of_all_codes <- round(comparison_of_all_codes * 100, digits = 1)
print(comparison_of_all_codes)
## AZ1865 CA1850 CA1851 CT1879 FL1870 GA1860 KY1851 MI1853
## AZ1865 100.0 30.4 75.9 4.4 16.9 1.5 6.7 2.4
## CA1850 13.2 100.0 13.9 3.8 16.4 0.9 5.0 1.7
## CA1851 76.8 32.6 100.0 4.4 17.9 1.5 7.2 2.5
## CT1879 0.3 0.6 0.3 100.0 0.4 0.1 0.3 0.1
## FL1870 18.0 40.1 18.8 6.6 100.0 1.4 7.0 5.2
## GA1860 2.2 3.1 2.2 2.9 2.0 100.0 2.0 1.7
## KY1851 6.1 10.5 6.5 3.7 6.0 1.2 100.0 1.6
## MI1853 1.3 2.1 1.3 0.6 2.6 0.6 0.9 100.0
## MO1849 5.4 13.2 5.6 1.7 6.8 0.6 4.4 1.2
## MO1856 4.4 10.8 4.5 5.3 5.9 0.9 4.3 1.9
## NC1868 17.6 40.3 18.4 6.7 44.9 1.7 7.3 2.6
## NV1861 69.0 31.1 69.5 4.5 17.6 1.8 7.1 2.5
## NV1869 64.5 28.9 65.6 4.3 16.9 1.6 6.9 2.4
## NY1848 10.9 28.2 11.5 3.0 19.6 0.7 4.7 1.6
## NY1849 16.2 42.3 17.4 5.1 35.3 1.1 6.3 2.2
## NY1850 37.3 33.3 38.7 6.1 28.0 2.2 11.5 3.2
## OH1853 8.6 16.3 9.1 6.2 9.8 1.8 22.5 1.9
## OH1853extended 11.7 17.1 11.9 7.0 10.6 2.4 23.1 2.5
## UT1853 0.2 0.3 0.2 0.2 0.2 0.1 0.2 0.2
## UT1870 47.9 21.5 47.6 3.8 12.6 1.3 5.4 2.1
## MO1849 MO1856 NC1868 NV1861 NV1869 NY1848 NY1849 NY1850
## AZ1865 13.8 7.8 14.0 54.4 57.9 20.0 19.8 13.7
## CA1850 14.6 8.3 13.9 10.6 11.2 22.5 22.4 5.3
## CA1851 14.5 8.1 14.8 55.4 59.6 21.4 21.6 14.4
## CT1879 0.3 0.6 0.4 0.2 0.3 0.4 0.4 0.2
## FL1870 18.6 11.1 38.1 14.8 16.1 38.3 45.8 10.9
## GA1860 2.1 2.4 2.1 2.1 2.2 2.0 1.9 1.2
## KY1851 10.2 6.9 5.3 5.1 5.6 7.9 7.0 3.9
## MI1853 1.7 1.8 1.1 1.1 1.2 1.6 1.4 0.6
## MO1849 100.0 20.0 6.2 4.4 4.6 14.7 10.1 2.5
## MO1856 28.8 100.0 5.1 3.7 4.0 8.8 6.7 2.0
## NC1868 19.7 11.2 100.0 14.5 15.7 33.2 41.4 10.5
## NV1861 14.1 8.4 14.7 100.0 63.9 20.2 20.1 14.0
## NV1869 12.9 8.0 14.0 56.1 100.0 19.0 18.7 13.2
## NY1848 20.5 8.5 14.4 8.7 9.3 100.0 36.3 7.5
## NY1849 21.1 9.8 27.0 13.0 13.8 54.5 100.0 11.9
## NY1850 17.2 9.6 22.6 30.0 32.1 37.6 39.6 100.0
## OH1853 11.0 8.3 8.7 7.3 8.0 10.8 10.9 5.0
## OH1853extended 11.5 8.9 9.6 9.8 10.6 11.6 11.8 6.9
## UT1853 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.1
## UT1870 9.6 6.3 10.3 42.0 55.9 14.2 14.0 9.8
## OH1853 OH1853extended UT1853 UT1870
## AZ1865 9.4 8.3 2.2 41.8
## CA1850 7.7 5.3 1.5 8.1
## CA1851 10.1 8.6 2.2 42.0
## CT1879 0.4 0.3 0.1 0.2
## FL1870 11.4 8.1 2.1 11.6
## GA1860 2.9 2.5 2.4 1.7
## KY1851 22.4 15.1 1.6 4.3
## MI1853 1.1 0.9 1.1 1.0
## MO1849 4.7 3.2 1.2 3.3
## MO1856 5.1 3.6 1.7 3.1
## NC1868 11.9 8.6 2.3 11.2
## NV1861 10.2 8.9 2.7 46.5
## NV1869 9.7 8.4 2.2 54.3
## NY1848 6.5 4.5 1.1 6.7
## NY1849 9.8 6.9 1.4 10.0
## NY1850 15.0 13.4 3.5 23.2
## OH1853 100.0 65.3 2.4 6.2
## OH1853extended 100.0 100.0 3.8 8.4
## UT1853 0.2 0.3 100.0 0.2
## UT1870 7.8 6.9 2.3 100.0
Let’s write that comparison to disk for someone who know about the history of codes to make sense of it.
write.csv(comparison_of_all_codes, "out/comparison_of_all_codes.csv")
What we really care about is just the triangular comparison.
triangle <- comparison_of_all_codes
triangle[upper.tri(triangle, diag = TRUE)] <- NA
print(triangle)
## AZ1865 CA1850 CA1851 CT1879 FL1870 GA1860 KY1851 MI1853
## AZ1865 NA NA NA NA NA NA NA NA
## CA1850 13.2 NA NA NA NA NA NA NA
## CA1851 76.8 32.6 NA NA NA NA NA NA
## CT1879 0.3 0.6 0.3 NA NA NA NA NA
## FL1870 18.0 40.1 18.8 6.6 NA NA NA NA
## GA1860 2.2 3.1 2.2 2.9 2.0 NA NA NA
## KY1851 6.1 10.5 6.5 3.7 6.0 1.2 NA NA
## MI1853 1.3 2.1 1.3 0.6 2.6 0.6 0.9 NA
## MO1849 5.4 13.2 5.6 1.7 6.8 0.6 4.4 1.2
## MO1856 4.4 10.8 4.5 5.3 5.9 0.9 4.3 1.9
## NC1868 17.6 40.3 18.4 6.7 44.9 1.7 7.3 2.6
## NV1861 69.0 31.1 69.5 4.5 17.6 1.8 7.1 2.5
## NV1869 64.5 28.9 65.6 4.3 16.9 1.6 6.9 2.4
## NY1848 10.9 28.2 11.5 3.0 19.6 0.7 4.7 1.6
## NY1849 16.2 42.3 17.4 5.1 35.3 1.1 6.3 2.2
## NY1850 37.3 33.3 38.7 6.1 28.0 2.2 11.5 3.2
## OH1853 8.6 16.3 9.1 6.2 9.8 1.8 22.5 1.9
## OH1853extended 11.7 17.1 11.9 7.0 10.6 2.4 23.1 2.5
## UT1853 0.2 0.3 0.2 0.2 0.2 0.1 0.2 0.2
## UT1870 47.9 21.5 47.6 3.8 12.6 1.3 5.4 2.1
## MO1849 MO1856 NC1868 NV1861 NV1869 NY1848 NY1849 NY1850
## AZ1865 NA NA NA NA NA NA NA NA
## CA1850 NA NA NA NA NA NA NA NA
## CA1851 NA NA NA NA NA NA NA NA
## CT1879 NA NA NA NA NA NA NA NA
## FL1870 NA NA NA NA NA NA NA NA
## GA1860 NA NA NA NA NA NA NA NA
## KY1851 NA NA NA NA NA NA NA NA
## MI1853 NA NA NA NA NA NA NA NA
## MO1849 NA NA NA NA NA NA NA NA
## MO1856 28.8 NA NA NA NA NA NA NA
## NC1868 19.7 11.2 NA NA NA NA NA NA
## NV1861 14.1 8.4 14.7 NA NA NA NA NA
## NV1869 12.9 8.0 14.0 56.1 NA NA NA NA
## NY1848 20.5 8.5 14.4 8.7 9.3 NA NA NA
## NY1849 21.1 9.8 27.0 13.0 13.8 54.5 NA NA
## NY1850 17.2 9.6 22.6 30.0 32.1 37.6 39.6 NA
## OH1853 11.0 8.3 8.7 7.3 8.0 10.8 10.9 5.0
## OH1853extended 11.5 8.9 9.6 9.8 10.6 11.6 11.8 6.9
## UT1853 0.3 0.3 0.2 0.2 0.2 0.2 0.2 0.1
## UT1870 9.6 6.3 10.3 42.0 55.9 14.2 14.0 9.8
## OH1853 OH1853extended UT1853 UT1870
## AZ1865 NA NA NA NA
## CA1850 NA NA NA NA
## CA1851 NA NA NA NA
## CT1879 NA NA NA NA
## FL1870 NA NA NA NA
## GA1860 NA NA NA NA
## KY1851 NA NA NA NA
## MI1853 NA NA NA NA
## MO1849 NA NA NA NA
## MO1856 NA NA NA NA
## NC1868 NA NA NA NA
## NV1861 NA NA NA NA
## NV1869 NA NA NA NA
## NY1848 NA NA NA NA
## NY1849 NA NA NA NA
## NY1850 NA NA NA NA
## OH1853 NA NA NA NA
## OH1853extended 100.0 NA NA NA
## UT1853 0.2 0.3 NA NA
## UT1870 7.8 6.9 2.3 NA
What is the distribution of comparisons?
triangle %>% hist()
For each code we can get a sorted list of the codes that it is most like.
for(i in 1:ncol(comparison_of_all_codes)) {
x <- comparison_of_all_codes[,i]
title <- colnames(comparison_of_all_codes)[i]
cat("\n\n", title, " is most like:\n", sep = "")
cat("---------\n")
y <- sort(x, decreasing = TRUE)
for(j in 2:length(y)) {
cat(names(y)[j], y[j], "\n", sep = " ")
}
}
##
##
## AZ1865 is most like:
## ---------
## CA1851 76.8
## NV1861 69
## NV1869 64.5
## UT1870 47.9
## NY1850 37.3
## FL1870 18
## NC1868 17.6
## NY1849 16.2
## CA1850 13.2
## OH1853extended 11.7
## NY1848 10.9
## OH1853 8.6
## KY1851 6.1
## MO1849 5.4
## MO1856 4.4
## GA1860 2.2
## MI1853 1.3
## CT1879 0.3
## UT1853 0.2
##
##
## CA1850 is most like:
## ---------
## NY1849 42.3
## NC1868 40.3
## FL1870 40.1
## NY1850 33.3
## CA1851 32.6
## NV1861 31.1
## AZ1865 30.4
## NV1869 28.9
## NY1848 28.2
## UT1870 21.5
## OH1853extended 17.1
## OH1853 16.3
## MO1849 13.2
## MO1856 10.8
## KY1851 10.5
## GA1860 3.1
## MI1853 2.1
## CT1879 0.6
## UT1853 0.3
##
##
## CA1851 is most like:
## ---------
## AZ1865 75.9
## NV1861 69.5
## NV1869 65.6
## UT1870 47.6
## NY1850 38.7
## FL1870 18.8
## NC1868 18.4
## NY1849 17.4
## CA1850 13.9
## OH1853extended 11.9
## NY1848 11.5
## OH1853 9.1
## KY1851 6.5
## MO1849 5.6
## MO1856 4.5
## GA1860 2.2
## MI1853 1.3
## CT1879 0.3
## UT1853 0.2
##
##
## CT1879 is most like:
## ---------
## OH1853extended 7
## NC1868 6.7
## FL1870 6.6
## OH1853 6.2
## NY1850 6.1
## MO1856 5.3
## NY1849 5.1
## NV1861 4.5
## AZ1865 4.4
## CA1851 4.4
## NV1869 4.3
## CA1850 3.8
## UT1870 3.8
## KY1851 3.7
## NY1848 3
## GA1860 2.9
## MO1849 1.7
## MI1853 0.6
## UT1853 0.2
##
##
## FL1870 is most like:
## ---------
## NC1868 44.9
## NY1849 35.3
## NY1850 28
## NY1848 19.6
## CA1851 17.9
## NV1861 17.6
## AZ1865 16.9
## NV1869 16.9
## CA1850 16.4
## UT1870 12.6
## OH1853extended 10.6
## OH1853 9.8
## MO1849 6.8
## KY1851 6
## MO1856 5.9
## MI1853 2.6
## GA1860 2
## CT1879 0.4
## UT1853 0.2
##
##
## GA1860 is most like:
## ---------
## OH1853extended 2.4
## NY1850 2.2
## NV1861 1.8
## OH1853 1.8
## NC1868 1.7
## NV1869 1.6
## AZ1865 1.5
## CA1851 1.5
## FL1870 1.4
## UT1870 1.3
## KY1851 1.2
## NY1849 1.1
## CA1850 0.9
## MO1856 0.9
## NY1848 0.7
## MI1853 0.6
## MO1849 0.6
## CT1879 0.1
## UT1853 0.1
##
##
## KY1851 is most like:
## ---------
## OH1853extended 23.1
## OH1853 22.5
## NY1850 11.5
## NC1868 7.3
## CA1851 7.2
## NV1861 7.1
## FL1870 7
## NV1869 6.9
## AZ1865 6.7
## NY1849 6.3
## UT1870 5.4
## CA1850 5
## NY1848 4.7
## MO1849 4.4
## MO1856 4.3
## GA1860 2
## MI1853 0.9
## CT1879 0.3
## UT1853 0.2
##
##
## MI1853 is most like:
## ---------
## FL1870 5.2
## NY1850 3.2
## NC1868 2.6
## CA1851 2.5
## NV1861 2.5
## OH1853extended 2.5
## AZ1865 2.4
## NV1869 2.4
## NY1849 2.2
## UT1870 2.1
## MO1856 1.9
## OH1853 1.9
## CA1850 1.7
## GA1860 1.7
## KY1851 1.6
## NY1848 1.6
## MO1849 1.2
## UT1853 0.2
## CT1879 0.1
##
##
## MO1849 is most like:
## ---------
## MO1856 28.8
## NY1849 21.1
## NY1848 20.5
## NC1868 19.7
## FL1870 18.6
## NY1850 17.2
## CA1850 14.6
## CA1851 14.5
## NV1861 14.1
## AZ1865 13.8
## NV1869 12.9
## OH1853extended 11.5
## OH1853 11
## KY1851 10.2
## UT1870 9.6
## GA1860 2.1
## MI1853 1.7
## CT1879 0.3
## UT1853 0.3
##
##
## MO1856 is most like:
## ---------
## MO1849 20
## NC1868 11.2
## FL1870 11.1
## NY1849 9.8
## NY1850 9.6
## OH1853extended 8.9
## NY1848 8.5
## NV1861 8.4
## CA1850 8.3
## OH1853 8.3
## CA1851 8.1
## NV1869 8
## AZ1865 7.8
## KY1851 6.9
## UT1870 6.3
## GA1860 2.4
## MI1853 1.8
## CT1879 0.6
## UT1853 0.3
##
##
## NC1868 is most like:
## ---------
## FL1870 38.1
## NY1849 27
## NY1850 22.6
## CA1851 14.8
## NV1861 14.7
## NY1848 14.4
## AZ1865 14
## NV1869 14
## CA1850 13.9
## UT1870 10.3
## OH1853extended 9.6
## OH1853 8.7
## MO1849 6.2
## KY1851 5.3
## MO1856 5.1
## GA1860 2.1
## MI1853 1.1
## CT1879 0.4
## UT1853 0.2
##
##
## NV1861 is most like:
## ---------
## NV1869 56.1
## CA1851 55.4
## AZ1865 54.4
## UT1870 42
## NY1850 30
## FL1870 14.8
## NC1868 14.5
## NY1849 13
## CA1850 10.6
## OH1853extended 9.8
## NY1848 8.7
## OH1853 7.3
## KY1851 5.1
## MO1849 4.4
## MO1856 3.7
## GA1860 2.1
## MI1853 1.1
## CT1879 0.2
## UT1853 0.2
##
##
## NV1869 is most like:
## ---------
## NV1861 63.9
## CA1851 59.6
## AZ1865 57.9
## UT1870 55.9
## NY1850 32.1
## FL1870 16.1
## NC1868 15.7
## NY1849 13.8
## CA1850 11.2
## OH1853extended 10.6
## NY1848 9.3
## OH1853 8
## KY1851 5.6
## MO1849 4.6
## MO1856 4
## GA1860 2.2
## MI1853 1.2
## CT1879 0.3
## UT1853 0.2
##
##
## NY1848 is most like:
## ---------
## NY1849 54.5
## FL1870 38.3
## NY1850 37.6
## NC1868 33.2
## CA1850 22.5
## CA1851 21.4
## NV1861 20.2
## AZ1865 20
## NV1869 19
## MO1849 14.7
## UT1870 14.2
## OH1853extended 11.6
## OH1853 10.8
## MO1856 8.8
## KY1851 7.9
## GA1860 2
## MI1853 1.6
## CT1879 0.4
## UT1853 0.2
##
##
## NY1849 is most like:
## ---------
## FL1870 45.8
## NC1868 41.4
## NY1850 39.6
## NY1848 36.3
## CA1850 22.4
## CA1851 21.6
## NV1861 20.1
## AZ1865 19.8
## NV1869 18.7
## UT1870 14
## OH1853extended 11.8
## OH1853 10.9
## MO1849 10.1
## KY1851 7
## MO1856 6.7
## GA1860 1.9
## MI1853 1.4
## CT1879 0.4
## UT1853 0.2
##
##
## NY1850 is most like:
## ---------
## CA1851 14.4
## NV1861 14
## AZ1865 13.7
## NV1869 13.2
## NY1849 11.9
## FL1870 10.9
## NC1868 10.5
## UT1870 9.8
## NY1848 7.5
## OH1853extended 6.9
## CA1850 5.3
## OH1853 5
## KY1851 3.9
## MO1849 2.5
## MO1856 2
## GA1860 1.2
## MI1853 0.6
## CT1879 0.2
## UT1853 0.1
##
##
## OH1853 is most like:
## ---------
## OH1853extended 100
## KY1851 22.4
## NY1850 15
## NC1868 11.9
## FL1870 11.4
## NV1861 10.2
## CA1851 10.1
## NY1849 9.8
## NV1869 9.7
## AZ1865 9.4
## UT1870 7.8
## CA1850 7.7
## NY1848 6.5
## MO1856 5.1
## MO1849 4.7
## GA1860 2.9
## MI1853 1.1
## CT1879 0.4
## UT1853 0.2
##
##
## OH1853extended is most like:
## ---------
## OH1853 65.3
## KY1851 15.1
## NY1850 13.4
## NV1861 8.9
## CA1851 8.6
## NC1868 8.6
## NV1869 8.4
## AZ1865 8.3
## FL1870 8.1
## NY1849 6.9
## UT1870 6.9
## CA1850 5.3
## NY1848 4.5
## MO1856 3.6
## MO1849 3.2
## GA1860 2.5
## MI1853 0.9
## CT1879 0.3
## UT1853 0.3
##
##
## UT1853 is most like:
## ---------
## OH1853extended 3.8
## NY1850 3.5
## NV1861 2.7
## GA1860 2.4
## OH1853 2.4
## NC1868 2.3
## UT1870 2.3
## AZ1865 2.2
## CA1851 2.2
## NV1869 2.2
## FL1870 2.1
## MO1856 1.7
## KY1851 1.6
## CA1850 1.5
## NY1849 1.4
## MO1849 1.2
## MI1853 1.1
## NY1848 1.1
## CT1879 0.1
##
##
## UT1870 is most like:
## ---------
## NV1869 54.3
## NV1861 46.5
## CA1851 42
## AZ1865 41.8
## NY1850 23.2
## FL1870 11.6
## NC1868 11.2
## NY1849 10
## OH1853extended 8.4
## CA1850 8.1
## NY1848 6.7
## OH1853 6.2
## KY1851 4.3
## MO1849 3.3
## MO1856 3.1
## GA1860 1.7
## MI1853 1
## CT1879 0.2
## UT1853 0.2
Now this is more than we can figure out meaningfully today, but it’s possible to run standard distance and clustering functions on that matrix. I’m not sure what this even means just yet.
Let’s see the standard distance function:
distance <- comparison_of_all_codes %>% dist
print(distance)
## AZ1865 CA1850 CA1851 CT1879 FL1870 GA1860 KY1851 MI1853
## CA1850 146.83
## CA1851 33.71 147.01
## CT1879 188.42 146.91 190.66
## FL1870 151.20 110.28 150.93 164.89
## GA1860 185.26 145.25 187.40 139.54 163.74
## KY1851 173.60 134.85 175.10 143.07 150.82 142.13
## MI1853 186.89 146.09 189.03 141.00 162.55 139.87 143.34
## MO1849 170.48 124.86 171.91 143.75 140.69 142.11 133.81 142.37
## MO1856 176.08 132.30 177.76 141.61 148.66 142.30 136.47 142.41
## NC1868 152.95 111.60 152.81 164.28 83.20 162.99 150.38 164.12
## NV1861 56.06 152.68 55.01 194.07 156.18 190.65 178.95 192.49
## NV1869 57.48 150.20 56.15 189.91 153.96 186.63 175.15 188.36
## NY1848 156.74 106.90 157.08 151.59 106.62 149.87 139.90 150.30
## NY1849 153.13 105.07 152.51 165.33 86.92 163.70 150.81 164.42
## NY1850 124.01 128.73 123.41 173.77 122.11 171.38 155.34 172.82
## OH1853 176.15 140.93 177.21 156.71 152.80 155.50 121.12 157.84
## OH1853extended 185.89 158.14 186.78 174.76 167.31 173.10 139.36 175.63
## UT1853 189.85 148.82 192.09 141.21 168.00 139.92 144.71 140.59
## UT1870 85.33 141.96 86.49 173.09 149.50 170.42 161.34 171.83
## MO1849 MO1856 NC1868 NV1861 NV1869 NY1848 NY1849 NY1850
## CA1850
## CA1851
## CT1879
## FL1870
## GA1860
## KY1851
## MI1853
## MO1849
## MO1856 107.43
## NC1868 140.05 148.35
## NV1861 176.12 181.41 157.86
## NV1869 172.87 177.76 155.64 57.78
## NY1848 122.56 135.68 113.47 162.52 159.82
## NY1849 136.71 148.31 96.65 158.51 156.65 83.01
## NY1850 152.32 159.73 125.99 129.38 127.89 124.65 116.90
## OH1853 146.44 148.45 152.32 180.84 177.63 147.49 153.88 156.68
## OH1853extended 164.44 166.19 166.78 190.07 187.28 164.19 168.69 167.95
## UT1853 144.23 144.41 167.33 195.24 191.27 153.16 167.94 175.61
## UT1870 159.83 163.71 150.80 86.49 70.69 150.89 151.77 129.91
## OH1853 OH1853extended UT1853
## CA1850
## CA1851
## CT1879
## FL1870
## GA1860
## KY1851
## MI1853
## MO1849
## MO1856
## NC1868
## NV1861
## NV1869
## NY1848
## NY1849
## NY1850
## OH1853
## OH1853extended 35.37
## UT1853 159.36 176.81
## UT1870 166.20 177.80 174.16
And we can then pass that distance matrix to hierarchical clustering or k-means clustering (obviously we need to read up on all these) and plot each.
distance %>% hclust() %>% str()
## List of 7
## $ merge : int [1:19, 1:2] -1 -17 -12 -13 -14 -5 -20 -2 -9 6 ...
## $ height : num [1:19] 33.7 35.4 56.1 57.8 83 ...
## $ order : int [1:20] 20 13 12 1 3 16 5 11 2 14 ...
## $ labels : chr [1:20] "AZ1865" "CA1850" "CA1851" "CT1879" ...
## $ method : chr "complete"
## $ call : language hclust(d = distance)
## $ dist.method: chr "euclidean"
## - attr(*, "class")= chr "hclust"
distance %>% hclust() %>% plot()
I wouldn’t put any stock in that chart, but it does show what should be possible. And apart from the problem of the chronological arrow, it is interesting that western states are close to one another, the New York codes are close together, the two Ohio codes are at least close together, etc. Time to learn some statistics.
And now for k-means with a plot that I kind of get.
cluster <- kmeans(comparison_of_all_codes, centers = 5)
library(cluster)
clusplot(comparison_of_all_codes, cluster$cluster, labels = 2)