I think that a probabilistic, neural-network based classifier can usefully find errors in in the data.
Here’s the data from the classifier. In my terminology, guess is the class output, and real is the classification in the data according to MARC field 008. Importantly, that doesn’t include multiple languages; it forces every document to be just one.
Prob_of_guess and Prob_of_real are are the probabilites in the softmax layer of the NN.
All of these first tend books have been classified together.
The first pass at the real data lookup seems to have excluded ids with a + or = in the bookworm filename. Oops. THat’s about 451,000.
The overall rate of agreement between the two classifiers is 96.5%
The basic thing I’m going to be saying below is: it’s actually higher, in ways we can disentangle.
languages %>% group_by(correct) %>% summarize(n()/nrow(languages))
## Source: local data frame [2 x 2]
##
## correct n()/nrow(languages)
## <chr> <dbl>
## 1 False 0.03513312
## 2 True 0.96486688
Accuracy is different across languages. I told the classifier to only learn on the 69 most common languages, so values below are not shown in the precision charts.
Almost nothing is classed as und, for example, which is good.
The classifier also punts on some of the differences of Serbo-Croation langauges, which seems to have been changed. It also doesn’t do well at distinguishing Modern from Ancient greek (grc from gre). Also note that that we’re losing some classifier accuracy by never predicting anything as “und.”
get_precision_and_recall = function(languages) {
lang_recall = languages %>%
group_by(real) %>%
summarize(real_count=n(),recall=sum(as.logical(correct))/n())
lang_precision = languages %>%
group_by(guess) %>%
summarize(guess_count=n(),precision=sum(as.logical(correct))/n())
acc_prec = lang_recall %>% inner_join(lang_precision, by=c("real"="guess")) %>%
mutate(f1 = sqrt(recall*precision))
acc_prec %>% arrange(f1)
}
acc_prec = get_precision_and_recall(languages)
Here’s a plot of f1 by frequency. There are a lot of languages that only exist 100 times in the training corpus. The worst performance on a largish language, with 500 samples, is Indonesia at .3.
acc_prec %>% ggplot() + geom_text(aes(label = real, x=real_count,y=f1)) + scale_x_log10() + labs(title="Accuracy by corpus size")
## Warning: Removed 1 rows containing missing values (geom_text).
I’m a little worried Indonesian is actually wrong by the librarians as Hindi. Before I get to the point, let me just dig in there as one example of what the misclassifications look like for a single language.
total = nrow(languages)
confusion = languages %>% group_by(real) %>% mutate(real_count = n()) %>% group_by(guess) %>%
mutate(guess_count = n()) %>% group_by(guess,real,guess_count,real_count) %>% summarize(count=n(),expected = total * (guess_count[1]/total) * (real_count[1]/total))
confusion %>% filter(real=="ind") %>% arrange(-count)
## Source: local data frame [19 x 6]
## Groups: guess, real, guess_count [19]
##
## guess real guess_count real_count count expected
## <chr> <chr> <int> <int> <int> <dbl>
## 1 dut ind 40862 579 270 5.833190e+00
## 2 eng ind 2338274 579 118 3.337966e+02
## 3 ind ind 85 579 70 1.213404e-02
## 4 jav ind 160 579 53 2.284055e-02
## 5 fre ind 481205 579 23 6.869366e+01
## 6 ger ind 547119 579 16 7.810311e+01
## 7 spa ind 131695 579 7 1.879991e+01
## 8 rus ind 55621 579 6 7.940088e+00
## 9 jpn ind 48605 579 3 6.938530e+00
## 10 ota ind 5125 579 3 7.316113e-01
## 11 ara ind 6034 579 2 8.613741e-01
## 12 dan ind 19352 579 1 2.762564e+00
## 13 gre ind 4007 579 1 5.720129e-01
## 14 ita ind 107931 579 1 1.540752e+01
## 15 lat ind 96912 579 1 1.383452e+01
## 16 map ind 2 579 1 2.855068e-04
## 17 sun ind 4 579 1 5.710137e-04
## 18 swe ind 18537 579 1 2.646220e+00
## 19 NA ind 1823 579 1 2.602395e-01
So this is interesting. Out of 579 books, Indonesian is most often misclassed as Dutch, which is certainly something other than random chance; it must be that there’s a lot of Dutch text in the Indonesian texts. Another 53 are Javanese, which I don’t trust catalogers to always get right. But
But here’s the real point. We can greatly align the accuracy by using the classified probabilities. Every classifier estimate comes with a probability. Look how different those are for Indonesian and Thai (a non-western language with good f1 scores of similar probabilities.) Although the classifier is terrible at identifying Indonesian, it knows it’s making wild guesses.
languages %>% filter(real %in% c("ind","tha")) %>% ggplot() + geom_density(aes(x=prob_of_guess,fill=real),alpha=.3)
So some cleanup. I’m getting rid of smaller languages and things like “undetermined” that shouldn’t be counted against an accuracy score. I’m also going to eliminate the obsolete versions of Serbian and Croatian, as well as map. I’m tempted to kill ancient Greek since catalogers may be inconsistent in applying it, but won’t.
This ups our baseline accuracy from 96.5 to 97.1%.
#
top_langs = languages %>% count(guess) %>% transmute(real = guess)
cleaner = languages %>% inner_join(top_langs) %>% filter(!real %in% c("und","zxx","srp","hrv","map")) %>% mutate(correct = as.logical(correct))
## Joining, by = "real"
sum(cleaner$correct)/nrow(cleaner)
## [1] 0.9713856
So let’s use this to cut out.
We can break into different groups of certainty, and see how the accuracy differs.
grouped_acc =
cleaner %>% mutate(group = cut(prob_of_guess,quantile(prob_of_guess,probs=seq(0,1,1/9)),include.lowest = T)) %>%
group_by(group) %>%
summarize(accuracy = sum(as.logical(correct))/n(),errors = sum(!as.logical(correct))) %>% mutate(share_of_errors = errors/sum(errors))
grouped_acc
## Source: local data frame [9 x 4]
##
## group accuracy errors share_of_errors
## <fctr> <dbl> <int> <dbl>
## 1 [0.04437,0.9079] 0.8029689 88198 0.765082973
## 2 (0.9079,0.9741] 0.9729736 12098 0.104945393
## 3 (0.9741,0.9883] 0.9868442 5889 0.051084760
## 4 (0.9883,0.9938] 0.9916851 3722 0.032286887
## 5 (0.9938,0.9965] 0.9946900 2377 0.020619540
## 6 (0.9965,0.9979] 0.9968881 1393 0.012083727
## 7 (0.9979,0.9988] 0.9981191 842 0.007304019
## 8 (0.9988,0.9993] 0.9988875 498 0.004319954
## 9 (0.9993,1] 0.9994147 262 0.002272747
Accuracy is 75% when uncertain, but over 97% even for the second decile. For the 9th decile, accuracy is 99.9%. 70% of all errors are in the bottom decile.
So what’s in that bottom decile? I don’t know. From what I’ve seen, sheet music, handwriting, and the like. Plus a lot of bilingual stuff.
Taking five random items:
set.seed(pi)
cleaner %>% filter(prob_of_guess < .9,!correct) %>% sample_n(6)
## Source: local data frame [6 x 6]
##
## correct guess htid prob_of_guess prob_of_real real
## <lgl> <chr> <chr> <dbl> <dbl> <chr>
## 1 FALSE eng mdp.39015035576704 0.3527418 0.29048321 ger
## 2 FALSE lat hvd.hnyzka 0.2566990 0.09871665 fre
## 3 FALSE fre uc1.b3732666 0.5922508 0.05679197 ita
## 4 FALSE lat umn.31951001256574i 0.4161624 0.07866232 eng
## 5 FALSE chi keio.10811916286 0.3200729 0.22133885 jpn
## 6 FALSE chi keio.10812742303 0.4859906 0.46747908 jpn
Now, ideally, there would be a firm cutoff here where accuracy stopped improving.
We can also cut in the opposite direction: by the documents that appear least likely to be in the attributed language. I haven’t checked all of these, but I bet they’re all wrong.
cleaner %>% arrange(prob_of_real)
## Source: local data frame [4,028,710 x 6]
##
## correct guess htid prob_of_guess prob_of_real real
## <lgl> <chr> <chr> <dbl> <dbl> <chr>
## 1 FALSE por mdp.39015079026673 0.9442438 3.754634e-10 haw
## 2 FALSE ger nnc1.0037104187 0.9952642 9.436549e-10 chi
## 3 FALSE ita hvd.hx5t2r 0.9884364 1.176679e-09 arm
## 4 FALSE eng uc1.31175033912513 0.9990213 1.757927e-09 haw
## 5 FALSE ita inu.32000003301183 0.9960058 2.679585e-09 nor
## 6 FALSE eng mdp.39015065629183 0.9983971 2.879796e-09 haw
## 7 FALSE hun mdp.39015024455613 0.9985279 3.170034e-09 yid
## 8 FALSE eng mdp.39015070483618 0.9990039 3.758782e-09 cze
## 9 FALSE eng hvd.32044051131019 0.9980510 5.360801e-09 por
## 10 FALSE ita hvd.hnpn4g 0.9980879 7.190128e-09 ota
## .. ... ... ... ... ... ...
The space that these occupy is relatively evenly populated.
set.seed(pi)
cleaner %>% filter(!correct) %>% sample_n(10000) %>% ggplot() + geom_point(aes(x=prob_of_guess,y=prob_of_real,color=cut(prob_of_guess/prob_of_real,quantile(prob_of_guess/prob_of_real)))) + scale_color_discrete("ratio of guess\nto prob")
The most likely to be correct are those in the lower right of that triangle. The whole purple slice has some advantages. A .5-.5 split is basically useless. So I’ll arrange goodness by nearness to the lower-right corner–that is, purity of the guess over the actual estimate.
A threshold of .6 divides up the triangle like so: it aims to continue metadata improvement on about 36% of the data marked incorrect, or 1% of the full corpus.
keep_threshold = .66
cleaner %>% filter(!correct) %>% mutate(score = prob_of_guess - prob_of_real) %>% mutate(keep = score>keep_threshold) %>% sample_n(10000) %>% ggplot() + geom_point(aes(x=prob_of_guess,y=prob_of_real,color=keep))
So, let’s say we trust the classifier on documents right along the line: what do they look like? Here are the ten books closest to the line.
cleaner %>% filter(!correct) %>% mutate(score = prob_of_guess - prob_of_real) %>% arrange(abs(score-keep_threshold))
## Source: local data frame [115,279 x 7]
##
## correct guess htid prob_of_guess prob_of_real real
## <lgl> <chr> <chr> <dbl> <dbl> <chr>
## 1 FALSE chi keio.10811738256 0.7796289 0.11960491 jpn
## 2 FALSE fre hvd.32044102855764 0.7892624 0.12929015 eng
## 3 FALSE swe uiug.30112111043250 0.6804764 0.02044670 dan
## 4 FALSE ger hvd.32044098633035 0.7135932 0.05362957 eng
## 5 FALSE nor hvd.hnktsk 0.7898297 0.12978232 dan
## 6 FALSE chi mdp.39015085456450 0.8006454 0.14057972 jpn
## 7 FALSE ger coo.31924092555899 0.7846931 0.12462401 eng
## 8 FALSE dan wu.89099518003 0.7885745 0.12850158 nor
## 9 FALSE fre uc1.$b614729 0.7429235 0.08284323 ger
## 10 FALSE chi mdp.39015031104360 0.8108868 0.15080568 jpn
## .. ... ... ... ... ... ...
## Variables not shown: score <dbl>.
Redoing it with a slightly different balance on the triangle. This keeps roughly the same number of books: maybe a thousand or so fewer.
Note this solves the Chinese-Japanese problem, in part, without needing more help.
keep_threshold = .66
cleaner %>% filter(!correct) %>% mutate(score = .6 + .2 * prob_of_guess - 1.5 * prob_of_real) %>% mutate(keep = score>keep_threshold) %>% sample_n(10000) %>% ggplot() + geom_point(aes(x=prob_of_guess,y=prob_of_real,color=keep))
cleaner %>% filter(!correct) %>% mutate(score = .6 + .2 * prob_of_guess - 1.5 * prob_of_real) %>% arrange(abs(score-keep_threshold))
## Source: local data frame [115,279 x 7]
##
## correct guess htid prob_of_guess prob_of_real real
## <lgl> <chr> <chr> <dbl> <dbl> <chr>
## 1 FALSE ger hvd.32044040725210 0.6386160 0.04515094 NA
## 2 FALSE ger hvd.32044098633027 0.7061159 0.05415489 eng
## 3 FALSE chi mdp.39015076511016 0.5194751 0.02925719 eng
## 4 FALSE fre mdp.39015074806434 0.5750073 0.03665995 ita
## 5 FALSE rus chi.100890189 0.5026760 0.02701118 eng
## 6 FALSE ger mdp.39015028559287 0.8131678 0.06840931 grc
## 7 FALSE ger uc1.$b65194 0.8434854 0.07244946 lat
## 8 FALSE eng inu.30000121192318 0.3807266 0.01074773 hun
## 9 FALSE ger mdp.39015037047753 0.8941310 0.07923603 eng
## 10 FALSE lat ucm.5316852786 0.5404590 0.03204010 spa
## .. ... ... ... ... ... ...
## Variables not shown: score <dbl>.
The classifier isn’t stupid. There’s some GIGO with OCR, and a bigger question of how to treat authentically bilingual texts. It may be too quick to put them in one bin or the other.
Working with training data that included multiple languages where labeled would help 1. For training 2. For evaluation, because I could just filter them all out.
One last question is whether this can identify rare languages at all. I’ll check.
I’ll call a language rare if the classifier sees it fewer than 100 times.
rare_langs = cleaner %>% count(real) %>% filter(n<100)
cleaner %>% filter(!correct) %>%
mutate(score = .6 + .2 * prob_of_guess - 1.5 * prob_of_real) %>%
arrange(-score) %>% inner_join(rare_langs,by=c("guess" = "real"))
## Source: local data frame [16 x 8]
##
## correct guess htid prob_of_guess prob_of_real real
## <lgl> <chr> <chr> <dbl> <dbl> <chr>
## 1 FALSE wen mdp.39015051333717 0.63915646 0.02930553 pol
## 2 FALSE wen umn.319510017473174 0.42223975 0.03541834 pol
## 3 FALSE zul hvd.hwsirs 0.19844788 0.01017261 xho
## 4 FALSE zul hvd.hnjlpg 0.21070401 0.01273876 xho
## 5 FALSE zul hvd.hnjlpf 0.19353068 0.01793450 xho
## 6 FALSE zul hvd.hwsir4 0.12830934 0.01328732 xho
## 7 FALSE zul nyp.33433081990966 0.20354944 0.02348088 xho
## 8 FALSE haw mdp.39015063636750 0.14722157 0.01608531 dut
## 9 FALSE wen mdp.39015070301570 0.32012188 0.04486233 pol
## 10 FALSE sun coo.31924024024949 0.14970712 0.04796201 ind
## 11 FALSE cho uiug.30112064601765 0.13041344 0.04819134 eng
## 12 FALSE cho mdp.39015019906687 0.15389207 0.05984684 eng
## 13 FALSE haw mdp.39015073382700 0.09886745 0.06240880 eng
## 14 FALSE cho mdp.39015019906695 0.11666972 0.08354449 eng
## 15 FALSE mar nnc1.cu58912746 0.24460500 0.20618893 eng
## 16 FALSE wen hvd.hneb6d 0.29476866 0.23436643 cze
## Variables not shown: score <dbl>, n <int>.
Only 16 cases in the whole set where it thinks things are wrong. But in every one that I’ve looked at, it has a case; these do seem to be Zulu rather than Xhosa, Sorbian rather than Polish, and Choctaw rather than English.
The thing that makes these decisions stick out, presumably, is that these are all languages where the classifiers were decent initially. (f1 about .5 in the accuracy scores.) That there are so many errors readily visible (three Choctaw bibles mislabeled as English, for example) makes the success of of that initial classification higher. (There are only 26 books in Choctaw in the original set; that the classifier is confident in finding three more is kind of remarkable, or would be if they weren’t bibles.)
I didn’t initially save the full prediction matrices, but it might be worthwhile to use the classifiers to find other volumes that slip just below the threshold that might be in Choctaw. This could be framed as a cultural recovery project.