Years

This is a description of how I’m improving the year quality in the Hathi Trust Bookworm using Hathi MARC records. It’s mostly interesting internally for the project, but some of the details about parsing dates from Hathi records may be interesting for others using MARC records as a source of dates for text mining, or anyone interested in the performance of the Google Ngrams Serial killer algorithm. It also includes the first comparison of the Bookworm to the Google Ngram Viewer.

The major takeaways are:

Using date fields in the Hathi-specific MARC field 974 dramatically improves the coherence of the line charts from the Bookworm browser;
The new lines are roughly the same quality as those in the Google Ngram viewer;
The Hathi Bookworm is more than five times larger than the Google Ngrams corpus for the period 1800-1922, for several reason;
It includes many more periodicals, which with better dating is basically an unalloyed good;
There is further some room for improvement.

knitr::opts_chunk$set(cache=TRUE)
library(bookworm)
library(ngramr)

## Warning: replacing previous import by 'scales::alpha' when loading 'ngramr'

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

As a first test, I’m going to look at changing patterns of use of the word “evolution,” because it’s a word that we know to have particular trends in particular directions. I’m limiting all charts to the period 1800-1922, because that is when we have the most books.

First I define some functions. I’ll be looking at four different sources:

The most recent Google Ngrams corpus. (ngrams)
The date we are currently using in Hathi (old_date)
The date we are currently using in Hathi, with serials excluded (old_date_without_serials)
The new date, which incorporates information about date for serials as well as monographs.

In all of these, I look only at books identified as being in English.

Here are the trend lines for each one of these.

a = four_methods("evolution")

library(ggplot2)

ggplot(a) + geom_line(aes(x=year,y=WordsPerMillion,color=method)) + facet_wrap(~method)

The new line is less noisy. We can quantify this by converting each trend line to their standard deviation and fitting a loess curve that captures the overall trend. The intuition here is that words shouldn’t vary year-to-year, but instead have an overall trend. So the farther they are from their loess span, the more suspect they are. Here’s a plot with the trend lines:

scaled = a %>% group_by(method) %>% mutate(scaled = (WordsPerMillion - mean(WordsPerMillion))/sd(WordsPerMillion)) 
ggplot(scaled) + aes(x=year,y=scaled,color=method) + geom_line() + geom_smooth(method="loess",span=.2) + facet_wrap(~method)

The average residual is 12.2% on the old data (35% if you don’t exclude serials, which the main plot currently doesn’t): it’s 8.33% in Google Ngrams, and 8.26% in the new date assignments.

scaled %>% do(model=loess(scaled~year,data=.,span=.2)) %>% mutate(noise = model %>% summary %>% residuals %>% abs %>% mean %>% unlist) %>% ggplot() + geom_bar(aes(y=noise,x=reorder(method,noise)),stat="identity") + coord_flip() + labs (title="Average residual size from long-term trend\n(lower is better)")

By this metric, either the new dates or Google Ngrams seems to have the best system. But there’s a striking difference between the two of them: new_date has a usage rate about half that of Ngrams. Moreover, Ngrams keeps rising after 1900, while new_date levels out.

best = a %>% filter(method %in% c("ngrams","new_date"))
ggplot(best) + geom_line(aes(x=year,y=WordsPerMillion,color=method))

There are a few reasons that the results might be lower.

One is that the Hathi Bookworm corpus is simply different than the Ngrams corpus. Some major contributors to Google Books, such as the Bodleian library, are not Hathi consortium members. And conversely, Google Books does not include Hathi books scanned by the Internet Archive or Cornell (about 10% of the total).

More significantly, the Google Ngrams paper excludes large numbers of books that Hathi includes, for three reasons in particular:

Because they have low OCR quality.
Because they are possibly serials (which tend to be misdated.)
Because algorithmically they aren’t mostly English. (Our MARC records file a book as in English if it has any English in it; Ngrams does not.)

The net result of all these effects is that the Hathi Bookworm corpus is much larger than Google Ngrams for the period before 1922. Google Ngrams is around 1.25 billion words per year in the 1910s; the Hathi Bookworm is around 7.9 billion.

(You see a huge spike in the Hathi Bookworm around 1815; I believe that’s the Congressional Register, still mostly misdated. There is room for improvement.)

best = a %>% filter(method %in% c("ngrams","new_date")) %>% 
  mutate(rate=WordsPerMillion/1000000) %>%
  mutate(totalWords = WordCount/rate)

best %>% select(method,year,method,totalWords) %>% spread(method,totalWords) %>%
  mutate(ratio = new_date/ngrams) %>% ggplot() + geom_line(aes(x=year,y=ratio)) + scale_y_log10("Ratio of Hathi Bookworm total words to Google Ngrams total words",breaks=c(3,4,5,6,7,10,15)) + labs(title="The Hathi Trust Bookworm corpus ranges from three\nto seven times larger than Google Ngrams for the same years")

best %>% filter(year > 1910) %>% group_by(method) %>% summarize(averageBillionsOfWords=mean(totalWords/1e9))

## Source: local data frame [2 x 2]
## 
##     method averageBillionsOfWords
##      (chr)                  (dbl)
## 1 new_date               7.897820
## 2   ngrams               1.269034

We can see how strong the language effect is by just putting in some foreign language stopwords. A batch of some two-letter combos that are words in German, French, Spanish, or Italian is about 4-6 times more common in Hathi than in Ngrams. That suggests that there are more foreign-language texts; but not a truly tremendous amount. All together, there words are 0.2% of the corpus; “the” is more like 8%. So it’s not as thought 10% of the books are in French, or anything like that.

stopwords = four_methods(c("der","de","et","en","si","y","el","er")) %>% filter(method %in% c("ngrams","new_date"))

ggplot(stopwords) + geom_line(aes(x=year,y=WordsPerMillion,color=method)) + labs(title="Hathi Bookworm uses words like `der`, `et`, `y`, and `el`\nabout 5x more often than Google Ngrams")

I don’t have the capacity to check for bad OCR in a useful way.

A slightly less obvious question involves the restriction list in the “serial killer” algorithm that Ngrams uses to eliminate serials. The primary operation of that algorithm, as far as I can tell, operates by eliminating publications that have corporate authors. (It searches to see if the author field includes any state names, words like “Association”, “Committee”, or “the”, and so forth). This is quite a wide net: it removes about one in three books we have in Hathi.

split_by_guess = bookworm(query=list(
    database="hathipd",
    counttype = list("WordCount","WordsPerMillion"),
    words_collation="Case_Sensitive",
    search_limits=list("word"=list("evolution"),"new_date"=list("$lte"=1922,"$gte"=1800),
                       "languages"=list("English")),
    groups=list("new_date","serial_killer_guess"))) %>% mutate(year=new_date)



  split_by_guess %>% select(WordCount,serial_killer_guess,year) %>%
    spread(serial_killer_guess,WordCount) %>% 
    group_by(year) %>%
    summarize(percent_serial = 100*serial/(book + serial)) %>%
    ggplot() + geom_line(aes(x=year,y=percent_serial)) + 
  labs(title="The Serial Killer algorithm tags\nabout a third of the Hathi Trust Bookworm as serials,\nwhich is a lot.")

With our new, better dates, we can see that the serial killer algorithm may be helping even here to eliminate some articles from before Darwin’s publication. But on the other hand, there is a significant difference in the curves after the year 1900. It keeps an upward trajectory in books, but drops in serials.

  ggplot(split_by_guess) + geom_line(aes(x=year,y=WordsPerMillion,color=serial_killer_guess)) + 
  labs(title="The titles that Ngrams' 'serial killer' algorithm eliminates use 'evolution'\nless often after 1900 than those it doesn't eliminate ",y="Uses of `evolution` per million words")

So: one obvious question is whether it’s good to lose the serials or not. If serials tend to be misdated, they should be removed. (That’s why they’re not in Ngrams.) But monographs take years to prepare and are often reprints, while serials more often print texts actually written in the year in question. So if serials are well dated, they should produce better trends.

If we put in the name of a British political figure like “Disraeli,” We can see that the curve produced by just serial publications more quickly responds to Disraeli’s emergence as a major political figure in the late 1840s, spikes higher in his periods as prime minister from 1868 to 1880, and declines faster after his death in 1881.

split_by_guess = bookworm(query=list(
    database="hathipd",
    counttype = list("WordCount","WordsPerMillion"),
    words_collation="Case_Sensitive",
    search_limits=list("word"=list("Disraeli"),"new_date"=list("$lte"=1922,"$gte"=1800),
                       "languages"=list("English")),
    groups=list("new_date","serial_killer_guess"))) %>% mutate(year=new_date)

ggplot(split_by_guess) + geom_line(aes(x=year,y=WordsPerMillion,color=serial_killer_guess)) + 
  labs(title="Serials show a dropoff for the word 'Disraeli' after he stops being politically relevant fast than books",y="Uses of `Disraeli` per million words")

All of those seem roughly like good things. But it’s more of a mixed bag for American figures.

split_by_guess = bookworm(query=list(
    database="hathipd",
    counttype = list("WordCount","WordsPerMillion","TotalWords","TextCount","TotalTexts"),
    words_collation="Case_Sensitive",
    search_limits=list("word"=list("McKinley"),"new_date"=list("$lte"=1922,"$gte"=1800),
                       "languages"=list("English")),
    groups=list("new_date","serial_killer_guess","format")))

split_by_guess = split_by_guess %>% filter(format %in% c("Book","Serial")) %>% mutate(format = paste("MARC indicates ",format),serial_killer_guess = paste("Serial Killer guesses ",serial_killer_guess))

ggplot(split_by_guess %>% group_by(serial_killer_guess,new_date) %>% summarize(WordsPerMillion = 1000000*sum(WordCount)/sum(TotalWords))) + geom_line(aes(x=new_date,y=WordsPerMillion,color=serial_killer_guess)) + 
  labs(title="The Serial Killer filters out many early\nmentions of 'McKinley' before 1890,\nalthough some of those are real people",y="Uses of `McKinley` per million words")

Just to complicate things further, below is a four panel grid of the trend lines for the format (monograph/serial) as derived through two different means; the serial killer algorithm and the elements of the MARC record that indicate whether a record is a serial or monograph. Each of the quadrants is an interaction between the two. So the upper left is the intersection of MARC saying it’s a monograph and Serial-Killer saying it’s a monograph, the lower left is MARC saying a monograph but serial killer saying a serial, and so forth.

For each one, I’ve overlaid the total number of texts in each bin.

The takeaways from this chart are:

The serial killer algorithm aligns somewhat with the MARC records; precision and recall are both in the 80 to 90% range.
Removing serials throws away the best data for time series. The things that MARC thinks are serials and serial killer thinks are books (upper right) produce a much more defined curve around McKinley’s life than any other source. What appear to be some “bad” results in the late 1820s are actually mostly a number of correctly dated volumes of Niles’ Weekly Register and other sources picking up on the Senator and Supreme Court Justice.
Serial Killer throws away the worst information for time series. The things MARC thinks are a book but serial killer mistrusts (lower left) tend to be fairly bad.
The things both sources agree are serials (lower right) are surprisingly messy. This is probably because not all MARC serials have item-level dates. But I’m not sure why the volumes that the Serial Killer doesn’t pick up on do so much better.

counts = split_by_guess %>% group_by(serial_killer_guess,format) %>% summarize(texts = (sum(TotalTexts)/1000) %>% round(0) %>% paste("K"))

ggplot(split_by_guess) + geom_line(aes(x=new_date,y=WordsPerMillion)) + 
  labs(title="Trendlines for 'McKinley' across serials and monographs,\nas determined by Google Ngrams and MARC records",y="Uses of `McKinley` per million words") + facet_grid(serial_killer_guess~format) + geom_text(data=counts,x=1860,y=10,aes(label=texts),size=20,alpha=.2)

This suggests that we could further refine our dates by combining serial killer with MARC information. One strategy would be to only trust dates when

MARC and Serial killer agree it’s a monograph

The serial field is filled in in the MARC record, AND there is an item level date. (This would capture the top right corner and much of the bottom right corner, while throwing out about a quarter as many volumes as )

But I’m not going to fully investigate this strategy right now.

Years

Ben Schmidt

5/28/2016