library(magrittr)
library(dplyr)
library(quanteda)
library(ggplot2)
library(gridExtra)
library(stringr)
library(stargazer)
library(forcats)
library(DT)
knitr::opts_chunk$set(echo = TRUE)
load('data/ecco_tags.RData')
load('data/hathi_tags.RData')
load('data/bl_tags.RData')
bl_tags$dataset <- 'BL'
hathi_tags$dataset <- 'HATHI'
ecco_tags$dataset <- 'ECCO'
bl_tags <- filter(bl_tags, year < 1900)
comb <- bind_rows(bl_tags, hathi_tags, ecco_tags)

Three Book Collections, over Two Hundred Thousand Book Titles

How did the structure of book titles change over time? This post presents a descriptive analysis of metadata from digital book collections, paying particular attention to questions about titles raised in Franco Moretti’s 2009 article Style, Inc. Reflections on Seven Thousand Titles (British Novels, 1740–1850)1 .

Eighteenth Century Collections Online2 (ECCO), the British Library (BL) 19th Century Books corpus3, and the HathiTrust Digital Library metadata4, together provide metadata for almost a quarter of a million books, covering at least the period 1700–1900. The HathiTrust metadata is distributed from their website, separately from the full-texts of their digital collection. The metadata from ECCO and the BL collection were extracted from the full-text files in the course of our research at The Concept Lab5.

A quick look at the number of titles per year in each source makes immediately apparent the difference between the collections. ECCO is a comprehensive collection, claiming to contain “every significant English-language and foreign-language title printed in the United Kingdom between the years 1701 and 1800”. At the start of the Eighteenth Century it contains about a thousand titles per year, rising to between three and four thousand per year in the 1790s.

ecco_year <- group_by(ecco_tags, year) %>% summarize(book_count = n())
ggplot(ecco_year, aes(x = year, y = book_count)) +
  geom_point() + ggtitle("Books per year (ECCO)") + ylab('books published')

The British Library and Hathi Trust collections contain many fewer titles overall, rising from around a hundred to around a thousand titles per year over the course of the Nineteenth Century.

# books per year
comb19c <- bind_rows(bl_tags, hathi_tags)
comb19c_year <- group_by(comb19c, year, dataset) %>% summarize(book_count = n())
ggplot(comb19c_year, aes(x = year, y = book_count, color=dataset)) +
  geom_point() + theme(legend.position = c(0.1, 0.8)) + ggtitle("Books per year (19th Century collections)") + ylab('books published')


Volumes and Pages

All three collections include metadata for the number of pages and volumes in each book. ECCO records the number of volumes differently, with a separate book entry for each volume. Therefore, the number of pages for ECCO is the number of pages-per-volume, while the other two collections record the total number of pages across all volumes of the book. The outlier at 1801 may be accounted for by the 23-volume, 21,000 page work ‘The Beauties of England and Wales’. The notable rise and fall in average volumes per title the second half of the Nineteenth century is likely related to the popularity of ‘triple-decker’ novels.6

byYear <- bind_rows(ecco_tags, bl_tags) %>% group_by(year, dataset) %>% 
    summarize(bookCount=n(), totalPages=sum(num_pages), mean_vols = mean(num_vols))


ggplot(byYear, aes(x = year, y = mean_vols, color=dataset  )) +
  geom_point() + theme(legend.position = c(0.1, 0.8)) + labs(x = "year", y = "mean volumes per title") + ggtitle('Volumes per title') 

byYear <- group_by(bind_rows(hathi_tags, bl_tags), year, dataset) %>% 
    summarize(bookCount=n(), mean_pages=mean(num_pages), mean_vols = mean(num_vols))

pages1 <- ggplot(byYear, aes(x = year, y = mean_pages, color=dataset  )) + geom_point() +
  theme(legend.position = c(0.15, 0.7)) + ggtitle('mean pages per book (19th Century)')

byYear <- group_by(ecco_tags, year) %>% 
    summarize(bookCount=n(), mean_pages=mean(num_pages), mean_vols = mean(num_vols))
pages2 <- ggplot(byYear, aes(x = year, y = mean_pages)) + geom_point() + ggtitle('mean pages per volume (ECCO)')

grid.arrange(pages2, pages1, ncol=2)


Titles

Moretti (2009) analyses three sources of metadata for about seven thousand titles of British fiction 1740–1850, and notes a clear decrease in the length of titles over time. This trend is also clear in the Nineteenth Century collections here. The magnitude of the trend depends on the kind of pre-processing that has been applied to the title: the British Library collection contains a title field that includes long subtitles and meta-textual information (`in three volumes’, ‘a classic tale’) which Moretti excludes from his analysis. The approach I have taken is to parse each title with a sentence segmenter and retain only the first sentence in the title. This method preserves sections after ellipses or semi-colons, but crops the title at the first full-stop, exclamation point, or question mark, dealing appropriately with abbreviations containing full stops.

The HathiTrust metadata seems to have already undergone this kind of processing, and extra-textual or meta-textual notes are excluded from the title field in the original data. Even when only including the first sentence, there are many more long titles in the BL collection. This might be due to a very strict definition of what counts as part of the title in the HathiTrust collection, or due to the inclusion in the BL data of many non-fiction publications, especially geographical and historical works, with long, descriptive titles. For example, the BL data contains a work titled: The Wars of Succession of Portugal and Spain from 1826 to 1840: with résumé of the political history of Portugal and Spain to the present era. There are no such works in the Hathi collection, and the fiction, poetry, and drama titles don’t include long descriptive titles like these.

Despite the overall difference between the collections, there is a trend of steadily shortening titles in both. The trend is not as obvious in the HathiTrust data, but both the mean and the variance of the title lengths are lower at the end of the century.

comb19c <- mutate(comb19c, title_words = ntoken(short_title))
comb19c_year_dataset <- group_by(comb19c, year, dataset, Collection) %>% 
  summarize(book_count=n(), mean_pages=mean(num_pages), sd_pages=sd(num_pages),
            mean_title = mean(title_words), median_title=(median(title_words)))
hathi_year <- filter(comb19c_year_dataset, dataset == 'HATHI')
p1 <- ggplot(hathi_year, aes(x = year, y = mean_title, color=Collection)) +
  geom_point() + theme(legend.position = c(0.85, 0.9)) + ggtitle('Mean title length, Hathi data')
p2 <- ggplot(comb19c_year_dataset, aes(x = year, y = mean_title, color=dataset)) +
  geom_point() + theme(legend.position = c(0.85, 0.9)) + ggtitle('Mean title length, all 19C data')
grid.arrange(p1, p2, ncol=2)

The trend of decreasing title lengths is not apparent in ECCO. In some cases it seems that parsing errors have included a few very long texts as titles, which skews the mean for a particular year, so this plot shows median as well as mean title length:

ecco_tags <- mutate(ecco_tags, title_words = ntoken(short_title), raw_title_words=ntoken(raw_title))
ecco_year <- group_by(ecco_tags, year) %>%
    summarize(book_count=n(), median_title=median(title_words), mean_title = mean(title_words))
ggplot(ecco_year, aes(x = year, y = mean_title, color="mean words per title")) +
  geom_point() + geom_point(aes(y = median_title, color = "median words per title")) + ylim(c(10,30)) + theme(legend.position = c(0.85, 0.9)) + ggtitle('Mean and median title, ECCO')

Perhaps the summary statistics mask a trend among very long or very short titles? This plot shows the proportion of titles in each year that have at least twenty or at most five words.

ecco_tags <- mutate(ecco_tags, is_long_title = ifelse(title_words >= 20, "yes", "no"),
                   is_short_title = ifelse(title_words <= 5, "yes", "no") )

ecco_year_extra <- group_by(ecco_tags, year) %>%
    summarize(book_count=n(), is_long_title = length(is_long_title[is_long_title=='yes']),
    is_short_title = length(is_short_title[is_short_title=='yes'] )) %>%
    mutate(long_frac = is_long_title/book_count, short_frac = is_short_title/book_count )

ggplot(ecco_year_extra, aes(x = year, y = long_frac, color = '> 20 words')) +
  geom_point() + geom_point(aes(y = short_frac, color = '< 6 words'))  + ggtitle('Long and short titles, ECCO') + theme(legend.position = c(0.9, 0.5)) 

ECCO contains many different types of publication, while the Moretti (2009) worked with a dataset of exclusively British fiction. ECCO has a ‘Collection’ field giving a rough classification of the publications, and one of the values is ‘Language and Literature’. The titles in this subset of ECCO are on average much shorter than those of the collection as a whole, but again there’s no clear declining trend.

ecco_lit <- filter(ecco_tags, str_detect(Collection, 'Literature'))

ecco_lit_year <- group_by(ecco_lit, year) %>%
    summarize(book_count=n(), median_title=median(title_words), mean_title = mean(title_words))
ggplot(ecco_lit_year, aes(x = year, y = mean_title, color="mean words per title")) +
  geom_point() + geom_point(aes(y = median_title, color = "median words per title")) + ylim(c(0,20)) + ggtitle('Title length in Languages and Literature collection, ECCO')+ theme(legend.position = c(0.85, 0.9)) 


Crowded Markets

Moretti (2009) notes that fiction titles became shorter as the number of books published each year increased. We can clearly observe this correlation in the 19th Century datasets.

comb19c_year_dataset <- group_by(comb19c, year, dataset) %>% 
  summarize(book_count=n(), mean_pages=mean(num_pages), sd_pages=sd(num_pages),
            mean_title = mean(title_words), median_title=(median(title_words)))

ggplot(comb19c_year_dataset, aes(x = book_count, y = mean_title, color = dataset)) +
  geom_point() + ggtitle('Mean title length against number of books published in the year') + theme(legend.position = c(0.85, 0.9)) 

However, because the number of books, the number of distinct publishers, and the brevity of the titles are all highly correlated with time, it’s difficult to identify a simple causal relationship between the size of the market and title length. The BL data contains a metadata field specifically for the publisher, unlike the other two collections which record the imprint.

bl_tags <- mutate(bl_tags, title_words = ntoken(short_title))
bl_year <- group_by(bl_tags, year) %>% 
  summarize(book_count=n(), mean_title = mean(title_words), median_title=(median(title_words)), num_publisher = n_distinct(publisher), num_books = n()) %>% mutate(books_per_pub = num_books/num_publisher)
ggplot(bl_year, aes(x= year, y = num_publisher)) + geom_point() + ggtitle('Number of distinct publishers in each year')

This correlation plot shows the associations between title length, number of books published, number of publishers, and year

cor_sub <- select(bl_year, year, num_books, num_publisher, mean_title) %>% mutate(mean_title=mean_title)
M <- cor(cor_sub)
library(corrplot)
## corrplot 0.84 loaded
corrplot(M, method ='circle', cl.ratio = 0.3)


Verbs and Adjectives

As titles shorten, the range of possible syntactic structures that might be used in a title narrows. Because extra- and meta-textual description has been stripped from these titles, and the genre is restricted to fiction, poetry, and drama, a manageable number of possible part-of-speech patterns remain for analysis. The titles were parsed with the Python NLP package SpaCy7. The automatic parse has problems — the parser is trained on modern English, may be confused by title case and proper nouns, and is designed to work with full sentences rather than fragmentary titles. Nonetheless, the most frequent patterns are quite short, and relatively easy for the parser to deal with.

These are the part-of-speech patterns that occur in at least one hundred of the HathiTrust titles (ADP is ‘adposition’, the possible values for ‘Collection’ are ‘fiction’, ‘drama’, and ‘poetry’).

hathi_tags$pattern <- as.factor(hathi_tags$pattern)

hathi_tags <- mutate(hathi_tags, lump_pattern = fct_lump(pattern, n=100)) 
dt <- table(Pattern = hathi_tags$lump_pattern) %>% as.data.frame %>% arrange(desc(Freq))
DT::datatable(dt, options = list(dom = 'tp'))

Searching this interactive table can give a sense of part-of-speech patterns that structure particular titles. To keep the interactive table to a managable size, it only includes titles from the first two years and last two years of the collection.

tmp <- filter(hathi_tags, pattern != 'Other', year <= 1802 | year >= 1898) %>%
  mutate(period = ifelse(year < 1802, "early", "late")) %>% select(Collection, author, short_title, pattern,year) 

DT::datatable(tmp, options = list(dom = 'tp'), filter = 'top')

Moretti (2009) notes that titles use fewer verbs as they get shorter, and also draws attention to the nouns, adjectives, and articles that are chosen in short titles. There’s no clear decline in the proportion of titles per year that contain verbs in the HathiTrust data, though the overall proportion is low, and the variance decreases with time. The British Library data, with its long summarizing titles particularly at the start of the period, unsurprisingly shows a decrease in the proportion of titles containing verbs and adjectives over time.

hathi_tags <-
  mutate(
  hathi_tags,
  has_verb = ifelse(str_detect(pattern, "VERB"), "yes", "no"),
  has_adj = ifelse(str_detect(pattern, "ADJ"), "yes", "no")
  ) %>% mutate(title_words = ntoken(short_title))
  
byh <- group_by(hathi_tags, year) %>% summarize(book_count = n(),
  mean_title = mean(title_words),
  median_title = median(title_words),
  has_verb = length(has_verb[has_verb == 'yes']),
  has_adj = length(has_adj[has_adj == 'yes'])) %>%
  mutate(verb_frac = has_verb / book_count) %>% mutate(verb_prop = verb_frac / mean_title) %>%
  mutate(adj_frac = has_adj / book_count) %>% mutate(adj_prop = adj_frac /mean_title)
  
bl_tags <-
  mutate(
  bl_tags,
  has_verb = ifelse(str_detect(pattern, "VERB"), "yes", "no"),
  has_adj = ifelse(str_detect(pattern, "ADJ"), "yes", "no")
  ) %>% mutate(title_words = ntoken(short_title))
  
bybl <- group_by(bl_tags, year) %>% summarize(book_count = n(),
  mean_title = mean(title_words),
  median_title = median(title_words),
  has_verb = length(has_verb[has_verb == 'yes']),
  has_adj = length(has_adj[has_adj == 'yes'])) %>%
  mutate(verb_frac = has_verb / book_count) %>% mutate(verb_prop = verb_frac / mean_title) %>%
  mutate(adj_frac = has_adj / book_count) %>% mutate(adj_prop = adj_frac /mean_title)
  
vh <- ggplot(byh, aes(x = year, y = verb_frac)) + geom_point() + ggtitle('Fraction of titles containing verbs, Hathi data')
vbl <- ggplot(bybl, aes(x = year, y = verb_frac)) + geom_point() + ggtitle('Fraction of titles containing verbs, BL data')

ah <- ggplot(byh, aes(x = year, y = adj_frac)) + geom_point() + ggtitle('Fraction of titles containing adjectives, Hathi data')
abl <- ggplot(bybl, aes(x = year, y = adj_frac)) + geom_point() + ggtitle('Fraction of titles containing adjectives, BL data')
  
grid.arrange(vh, vbl, ah, abl, ncol = 2)

Is the syntactic pattern of the title related to the types of nouns in the title? The table below shows the frequency with which nouns occur in titles that contain adjectives compared to their frequency in titles without adjectives. The data is from short titles (four words or fewer) in the Hathi metadata. Meta-textual words (such as ‘story’, ‘poem’, ‘novel’, ‘tales’) have been removed from the data.

Further questions

Here I’ve simply provided a description of the metadata associated with these collections. I hope that it suggests further questions; please get in touch if you have suggestions or ideas for further analyses.


  1. Moretti, Franco. 2009. “Style, Inc. Reflections on Seven Thousand Titles (British Novels, 1740–1850).” Critical Inquiry 36 (1). JSTOR: 134–58.

  2. Gale Cengage

  3. This corpus contains digitized books from one section in the British Library, mostly containing history, geography, and language and literature. More information on the BL website.

  4. Hathi Trust Research Centre, and the Understanding Genre project by Ted Underwood

  5. The Concept Lab is an interdisciplinary project in the Centre for Research in Arts, Social Science, and Humanities (CRASSH) at the University of Cambridge. The ECCO metadata was extracted by Gabriel Recchia.

  6. The decline of ‘triple-decker’ novel has been attributed to changes in purchasing terms by circulating libraries

  7. SpaCy python NLP package: https://spacy.io/