I think that a probabilistic, neural-network based classifier can usefully find errors in in the data.

Here’s the data from the classifier. In my terminology, guess is the class output, and real is the classification in the data according to MARC field 008. Importantly, that doesn’t include multiple languages; it forces every document to be just one.

Prob_of_guess and Prob_of_real are are the probabilites in the softmax layer of the NN.

All of these first tend books have been classified together.

The first pass at the real data lookup seems to have excluded ids with a + or = in the bookworm filename. Oops. THat’s about 451,000.

The overall rate of agreement between the two classifiers is 96.5%

The basic thing I’m going to be saying below is: it’s actually higher, in ways we can disentangle.

languages %>% group_by(correct) %>% summarize(n()/nrow(languages))
## Source: local data frame [2 x 2]
## 
##   correct n()/nrow(languages)
##     <chr>               <dbl>
## 1   False          0.03513312
## 2    True          0.96486688

Accuracy is different across languages. I told the classifier to only learn on the 69 most common languages, so values below are not shown in the precision charts.

Almost nothing is classed as und, for example, which is good.

The classifier also punts on some of the differences of Serbo-Croation langauges, which seems to have been changed. It also doesn’t do well at distinguishing Modern from Ancient greek (grc from gre). Also note that that we’re losing some classifier accuracy by never predicting anything as “und.”

get_precision_and_recall = function(languages) {
  lang_recall = languages %>% 
    group_by(real) %>% 
    summarize(real_count=n(),recall=sum(as.logical(correct))/n())
    
  lang_precision = languages %>%
    group_by(guess) %>%
    summarize(guess_count=n(),precision=sum(as.logical(correct))/n())
  
  acc_prec = lang_recall %>% inner_join(lang_precision, by=c("real"="guess")) %>%
    mutate(f1 = sqrt(recall*precision))
  
  acc_prec %>% arrange(f1)
}

acc_prec = get_precision_and_recall(languages)

Here’s a plot of f1 by frequency. There are a lot of languages that only exist 100 times in the training corpus. The worst performance on a largish language, with 500 samples, is Indonesia at .3.

acc_prec %>% ggplot() + geom_text(aes(label = real, x=real_count,y=f1)) + scale_x_log10() + labs(title="Accuracy by corpus size")
## Warning: Removed 1 rows containing missing values (geom_text).

I’m a little worried Indonesian is actually wrong by the librarians as Hindi. Before I get to the point, let me just dig in there as one example of what the misclassifications look like for a single language.

total = nrow(languages)

confusion = languages %>% group_by(real) %>% mutate(real_count = n()) %>% group_by(guess) %>%
  mutate(guess_count = n()) %>% group_by(guess,real,guess_count,real_count) %>% summarize(count=n(),expected = total * (guess_count[1]/total) * (real_count[1]/total))


confusion %>% filter(real=="ind") %>% arrange(-count)
## Source: local data frame [19 x 6]
## Groups: guess, real, guess_count [19]
## 
##    guess  real guess_count real_count count     expected
##    <chr> <chr>       <int>      <int> <int>        <dbl>
## 1    dut   ind       40862        579   270 5.833190e+00
## 2    eng   ind     2338274        579   118 3.337966e+02
## 3    ind   ind          85        579    70 1.213404e-02
## 4    jav   ind         160        579    53 2.284055e-02
## 5    fre   ind      481205        579    23 6.869366e+01
## 6    ger   ind      547119        579    16 7.810311e+01
## 7    spa   ind      131695        579     7 1.879991e+01
## 8    rus   ind       55621        579     6 7.940088e+00
## 9    jpn   ind       48605        579     3 6.938530e+00
## 10   ota   ind        5125        579     3 7.316113e-01
## 11   ara   ind        6034        579     2 8.613741e-01
## 12   dan   ind       19352        579     1 2.762564e+00
## 13   gre   ind        4007        579     1 5.720129e-01
## 14   ita   ind      107931        579     1 1.540752e+01
## 15   lat   ind       96912        579     1 1.383452e+01
## 16   map   ind           2        579     1 2.855068e-04
## 17   sun   ind           4        579     1 5.710137e-04
## 18   swe   ind       18537        579     1 2.646220e+00
## 19    NA   ind        1823        579     1 2.602395e-01

So this is interesting. Out of 579 books, Indonesian is most often misclassed as Dutch, which is certainly something other than random chance; it must be that there’s a lot of Dutch text in the Indonesian texts. Another 53 are Javanese, which I don’t trust catalogers to always get right. But

Getting to the point.

But here’s the real point. We can greatly align the accuracy by using the classified probabilities. Every classifier estimate comes with a probability. Look how different those are for Indonesian and Thai (a non-western language with good f1 scores of similar probabilities.) Although the classifier is terrible at identifying Indonesian, it knows it’s making wild guesses.

languages %>% filter(real %in% c("ind","tha")) %>% ggplot() + geom_density(aes(x=prob_of_guess,fill=real),alpha=.3)

So some cleanup. I’m getting rid of smaller languages and things like “undetermined” that shouldn’t be counted against an accuracy score. I’m also going to eliminate the obsolete versions of Serbian and Croatian, as well as map. I’m tempted to kill ancient Greek since catalogers may be inconsistent in applying it, but won’t.

This ups our baseline accuracy from 96.5 to 97.1%.

# 
top_langs = languages %>% count(guess) %>% transmute(real = guess)
cleaner = languages %>% inner_join(top_langs) %>% filter(!real %in% c("und","zxx","srp","hrv","map")) %>% mutate(correct = as.logical(correct))
## Joining, by = "real"
sum(cleaner$correct)/nrow(cleaner)
## [1] 0.9713856

So let’s use this to cut out.

We can break into different groups of certainty, and see how the accuracy differs.

grouped_acc = 
  cleaner %>% mutate(group = cut(prob_of_guess,quantile(prob_of_guess,probs=seq(0,1,1/9)),include.lowest = T)) %>%
  group_by(group) %>% 
  summarize(accuracy = sum(as.logical(correct))/n(),errors = sum(!as.logical(correct))) %>% mutate(share_of_errors = errors/sum(errors))
grouped_acc
## Source: local data frame [9 x 4]
## 
##              group  accuracy errors share_of_errors
##             <fctr>     <dbl>  <int>           <dbl>
## 1 [0.04437,0.9079] 0.8029689  88198     0.765082973
## 2  (0.9079,0.9741] 0.9729736  12098     0.104945393
## 3  (0.9741,0.9883] 0.9868442   5889     0.051084760
## 4  (0.9883,0.9938] 0.9916851   3722     0.032286887
## 5  (0.9938,0.9965] 0.9946900   2377     0.020619540
## 6  (0.9965,0.9979] 0.9968881   1393     0.012083727
## 7  (0.9979,0.9988] 0.9981191    842     0.007304019
## 8  (0.9988,0.9993] 0.9988875    498     0.004319954
## 9       (0.9993,1] 0.9994147    262     0.002272747

Accuracy is 75% when uncertain, but over 97% even for the second decile. For the 9th decile, accuracy is 99.9%. 70% of all errors are in the bottom decile.

So what’s in that bottom decile? I don’t know. From what I’ve seen, sheet music, handwriting, and the like. Plus a lot of bilingual stuff.

Taking five random items:

set.seed(pi)
cleaner %>% filter(prob_of_guess < .9,!correct) %>% sample_n(6)
## Source: local data frame [6 x 6]
## 
##   correct guess                htid prob_of_guess prob_of_real  real
##     <lgl> <chr>               <chr>         <dbl>        <dbl> <chr>
## 1   FALSE   eng  mdp.39015035576704     0.3527418   0.29048321   ger
## 2   FALSE   lat          hvd.hnyzka     0.2566990   0.09871665   fre
## 3   FALSE   fre        uc1.b3732666     0.5922508   0.05679197   ita
## 4   FALSE   lat umn.31951001256574i     0.4161624   0.07866232   eng
## 5   FALSE   chi    keio.10811916286     0.3200729   0.22133885   jpn
## 6   FALSE   chi    keio.10812742303     0.4859906   0.46747908   jpn
  1. A German-language serial that is mostly in English for this issue. Confusion warranted. Got some French too.
  2. A French-language bibliography of Greek titles. Bibliographies are always hard, but there doesn’t seem to be much Latin here, although I’m sure there’s some. French is an accurate language choice.
  3. This journal is listed as like five different languages in the metadata. Who knows.
  4. Actually neither English or Latin, but Icelandic. 5-6. The model hedges its bets between Chinese and Japanese pretty closely as a rule, since most documents from Keio are handwritten and have crummy OCR, but seems to usually put Chinese a little more likely.

Now, ideally, there would be a firm cutoff here where accuracy stopped improving.

We can also cut in the opposite direction: by the documents that appear least likely to be in the attributed language. I haven’t checked all of these, but I bet they’re all wrong.

cleaner %>% arrange(prob_of_real)
## Source: local data frame [4,028,710 x 6]
## 
##    correct guess               htid prob_of_guess prob_of_real  real
##      <lgl> <chr>              <chr>         <dbl>        <dbl> <chr>
## 1    FALSE   por mdp.39015079026673     0.9442438 3.754634e-10   haw
## 2    FALSE   ger    nnc1.0037104187     0.9952642 9.436549e-10   chi
## 3    FALSE   ita         hvd.hx5t2r     0.9884364 1.176679e-09   arm
## 4    FALSE   eng uc1.31175033912513     0.9990213 1.757927e-09   haw
## 5    FALSE   ita inu.32000003301183     0.9960058 2.679585e-09   nor
## 6    FALSE   eng mdp.39015065629183     0.9983971 2.879796e-09   haw
## 7    FALSE   hun mdp.39015024455613     0.9985279 3.170034e-09   yid
## 8    FALSE   eng mdp.39015070483618     0.9990039 3.758782e-09   cze
## 9    FALSE   eng hvd.32044051131019     0.9980510 5.360801e-09   por
## 10   FALSE   ita         hvd.hnpn4g     0.9980879 7.190128e-09   ota
## ..     ...   ...                ...           ...          ...   ...

The space that these occupy is relatively evenly populated.

set.seed(pi)
cleaner %>% filter(!correct) %>% sample_n(10000) %>% ggplot() + geom_point(aes(x=prob_of_guess,y=prob_of_real,color=cut(prob_of_guess/prob_of_real,quantile(prob_of_guess/prob_of_real)))) + scale_color_discrete("ratio of guess\nto prob")

The most likely to be correct are those in the lower right of that triangle. The whole purple slice has some advantages. A .5-.5 split is basically useless. So I’ll arrange goodness by nearness to the lower-right corner–that is, purity of the guess over the actual estimate.

A threshold of .6 divides up the triangle like so: it aims to continue metadata improvement on about 36% of the data marked incorrect, or 1% of the full corpus.

keep_threshold = .66
cleaner %>% filter(!correct) %>% mutate(score = prob_of_guess - prob_of_real) %>% mutate(keep = score>keep_threshold) %>%  sample_n(10000) %>% ggplot() + geom_point(aes(x=prob_of_guess,y=prob_of_real,color=keep))

So, let’s say we trust the classifier on documents right along the line: what do they look like? Here are the ten books closest to the line.

cleaner %>% filter(!correct) %>% mutate(score = prob_of_guess - prob_of_real) %>% arrange(abs(score-keep_threshold))
## Source: local data frame [115,279 x 7]
## 
##    correct guess                htid prob_of_guess prob_of_real  real
##      <lgl> <chr>               <chr>         <dbl>        <dbl> <chr>
## 1    FALSE   chi    keio.10811738256     0.7796289   0.11960491   jpn
## 2    FALSE   fre  hvd.32044102855764     0.7892624   0.12929015   eng
## 3    FALSE   swe uiug.30112111043250     0.6804764   0.02044670   dan
## 4    FALSE   ger  hvd.32044098633035     0.7135932   0.05362957   eng
## 5    FALSE   nor          hvd.hnktsk     0.7898297   0.12978232   dan
## 6    FALSE   chi  mdp.39015085456450     0.8006454   0.14057972   jpn
## 7    FALSE   ger  coo.31924092555899     0.7846931   0.12462401   eng
## 8    FALSE   dan      wu.89099518003     0.7885745   0.12850158   nor
## 9    FALSE   fre        uc1.$b614729     0.7429235   0.08284323   ger
## 10   FALSE   chi  mdp.39015031104360     0.8108868   0.15080568   jpn
## ..     ...   ...                 ...           ...          ...   ...
## Variables not shown: score <dbl>.
  1. Chinese for Japanese. I’m going to drop this whole category.
  2. Classifier correct: almost entirely in French, but with an English language title (“Progressive French reader”). Scholarly apparatus in English.
  3. Classifier correct: Sure looks more like Swedish to me. I find a .68 to .02 split more convincing than a .8-.2 one, so I may change around the cutoff to slop down a bit more.
  4. Classifier correct: Metadata says bilingual, but this looks like German. There’s a fair amount of Greek and Latin in the examples, as well.
  5. Classifier correct? I can’t tell the difference between Danish and Norwegian, but the author is listed online as being Norwegian, and the publication place is Kristiania.
  6. Chinese-Japanese again.
  7. Classifier correct. German language volume in multilingual series, it seems.
  8. Classifier probably wrong. I don’t even know if Danish and Norwegian are different languages at this point. Google Translate recognizes the title as Danish, but based on the place of publication and one ambiguous reference in the literature I think this is probably Norwegian with Danish spelling, or something like that. So the bibliographers
  9. Bilingual, although the metadata calls it just German. A German play with (larger amounts of?) French in the introduction and conclusion. 10 Chinese-Japanese.

Redoing it with a slightly different balance on the triangle. This keeps roughly the same number of books: maybe a thousand or so fewer.

Note this solves the Chinese-Japanese problem, in part, without needing more help.

keep_threshold = .66
cleaner %>% filter(!correct) %>% mutate(score = .6 + .2 * prob_of_guess - 1.5 * prob_of_real) %>% mutate(keep = score>keep_threshold) %>% sample_n(10000) %>% ggplot() + geom_point(aes(x=prob_of_guess,y=prob_of_real,color=keep))

cleaner %>% filter(!correct) %>% mutate(score = .6 + .2 * prob_of_guess - 1.5 * prob_of_real) %>% arrange(abs(score-keep_threshold))
## Source: local data frame [115,279 x 7]
## 
##    correct guess               htid prob_of_guess prob_of_real  real
##      <lgl> <chr>              <chr>         <dbl>        <dbl> <chr>
## 1    FALSE   ger hvd.32044040725210     0.6386160   0.04515094    NA
## 2    FALSE   ger hvd.32044098633027     0.7061159   0.05415489   eng
## 3    FALSE   chi mdp.39015076511016     0.5194751   0.02925719   eng
## 4    FALSE   fre mdp.39015074806434     0.5750073   0.03665995   ita
## 5    FALSE   rus      chi.100890189     0.5026760   0.02701118   eng
## 6    FALSE   ger mdp.39015028559287     0.8131678   0.06840931   grc
## 7    FALSE   ger        uc1.$b65194     0.8434854   0.07244946   lat
## 8    FALSE   eng inu.30000121192318     0.3807266   0.01074773   hun
## 9    FALSE   ger mdp.39015037047753     0.8941310   0.07923603   eng
## 10   FALSE   lat     ucm.5316852786     0.5404590   0.03204010   spa
## ..     ...   ...                ...           ...          ...   ...
## Variables not shown: score <dbl>.
  1. Classifer correct. But Ugh. Sheet music, in German; I didn’t think there was supposed to be any NA values here.
  2. Classifier correct. Another issue of the same German philological journal, misplaced into English.
  3. Classifier correct. This is Chinese, not English.
  4. Classifier correct. This is French, not Italian. mdp.39015074806434
  5. Classifier is one-third right, metadata is entirely wrong. Seems to be partly in Russian (published in Moscow), but this is really a highly multilingual text; mostly French and German that I could see.
  6. Split decision. Actually mostly German, but there is a lot of Greek. (It’s an annotated Iliad.)
  7. Split decision: a mostly German annotated copy of a shorter original text in Latin. Probably a little overconfident.
  8. Classifier wrong; this is Hungarian. Atrocious OCR is the cause of the problem, I
    think: typical line “Ηοεγ ιιιιιιάΘτι· ΚιτειεειεΙοττ ειΚειτειτ Ιιο22έιτιΚ ΙιειιοΙιοτι, even though Hungarian is Roman characters. The .38 prob of guess for English, specifically, may be too low a threshold.
  9. Split decision. Edition of Heine in German with English apparatus.
  10. Classifier Correct. This is a Latin edition of Hebrew. No Spanish that I see.

Conclusions

The classifier isn’t stupid. There’s some GIGO with OCR, and a bigger question of how to treat authentically bilingual texts. It may be too quick to put them in one bin or the other.

Working with training data that included multiple languages where labeled would help 1. For training 2. For evaluation, because I could just filter them all out.

One last question

One last question is whether this can identify rare languages at all. I’ll check.

I’ll call a language rare if the classifier sees it fewer than 100 times.

rare_langs = cleaner %>% count(real) %>% filter(n<100)

cleaner %>% filter(!correct) %>%
  mutate(score = .6 + .2 * prob_of_guess - 1.5 * prob_of_real) %>%
  arrange(-score) %>% inner_join(rare_langs,by=c("guess" = "real"))
## Source: local data frame [16 x 8]
## 
##    correct guess                htid prob_of_guess prob_of_real  real
##      <lgl> <chr>               <chr>         <dbl>        <dbl> <chr>
## 1    FALSE   wen  mdp.39015051333717    0.63915646   0.02930553   pol
## 2    FALSE   wen umn.319510017473174    0.42223975   0.03541834   pol
## 3    FALSE   zul          hvd.hwsirs    0.19844788   0.01017261   xho
## 4    FALSE   zul          hvd.hnjlpg    0.21070401   0.01273876   xho
## 5    FALSE   zul          hvd.hnjlpf    0.19353068   0.01793450   xho
## 6    FALSE   zul          hvd.hwsir4    0.12830934   0.01328732   xho
## 7    FALSE   zul  nyp.33433081990966    0.20354944   0.02348088   xho
## 8    FALSE   haw  mdp.39015063636750    0.14722157   0.01608531   dut
## 9    FALSE   wen  mdp.39015070301570    0.32012188   0.04486233   pol
## 10   FALSE   sun  coo.31924024024949    0.14970712   0.04796201   ind
## 11   FALSE   cho uiug.30112064601765    0.13041344   0.04819134   eng
## 12   FALSE   cho  mdp.39015019906687    0.15389207   0.05984684   eng
## 13   FALSE   haw  mdp.39015073382700    0.09886745   0.06240880   eng
## 14   FALSE   cho  mdp.39015019906695    0.11666972   0.08354449   eng
## 15   FALSE   mar     nnc1.cu58912746    0.24460500   0.20618893   eng
## 16   FALSE   wen          hvd.hneb6d    0.29476866   0.23436643   cze
## Variables not shown: score <dbl>, n <int>.

Only 16 cases in the whole set where it thinks things are wrong. But in every one that I’ve looked at, it has a case; these do seem to be Zulu rather than Xhosa, Sorbian rather than Polish, and Choctaw rather than English.

The thing that makes these decisions stick out, presumably, is that these are all languages where the classifiers were decent initially. (f1 about .5 in the accuracy scores.) That there are so many errors readily visible (three Choctaw bibles mislabeled as English, for example) makes the success of of that initial classification higher. (There are only 26 books in Choctaw in the original set; that the classifier is confident in finding three more is kind of remarkable, or would be if they weren’t bibles.)

I didn’t initially save the full prediction matrices, but it might be worthwhile to use the classifiers to find other volumes that slip just below the threshold that might be in Choctaw. This could be framed as a cultural recovery project.