This project explores the earliest century of English printed books — the 1400s — using a novel combination of generative AI, classical data science, and historical inquiry. Drawing from the Early English Books Online (EEBO) corpus, our goal is to surface new insights from texts published between 1400 and 1499, a relatively quiet but foundational period in the history of print.
While the 1400s represent a small fraction of the overall EEBO collection, these texts are invaluable: they capture a moment when the written word began shifting from manuscript culture to the early mechanics of the printing press. By applying large language models (LLMs) and automated text analysis, we aim to make this underexplored slice of literary history more searchable, quantifiable, and meaningful.
This work serves as both a case study and a proof-of-concept: showing how modern AI tools can illuminate pre-modern texts and make the distant past more legible, searchable, and surprising.
Here’s a shorter, more active version of the Methodology section — concise, clear, and action-driven:
We focus on the earliest printed books in the EEBO corpus — those published between 1400 and 1499. Our approach combines large language models (LLMs), classical text mining, and data visualization to extract meaning from texts that are often linguistically and structurally challenging.
We filter the EEBO dataset to isolate books from the 1400s with valid publication dates and usable title and text fields. After removing OCR noise and empty entries, we work with a curated subset representing the complete known print output of the 15th century.
We use API-connected large language models to translate both titles and full texts into modern English. Prompts are engineered to retain historical vocabulary while skipping OCR artifacts and outdated syntax. We run translations in stable batches, log errors, and save intermediate results.
After translation, we clean the texts to remove formatting noise and
tokenize the data for analysis. We apply filters for stopwords, word
length, and character types, using tidytext in R for
structured outputs.
We analyze trends in word frequency, themes, and named entities. Topic models and n-gram patterns help us trace how certain concepts evolved. We visualize the findings using bar plots, word clouds, and time-based comparisons.
Our pipeline combines:
This hybrid workflow allows us to explore the intellectual DNA of the 1400s and build scalable methods for future historical analysis.
Before diving into deeper interpretation, we perform an exploratory analysis of the translated 1400s EEBO texts to better understand their structure, content, and thematic focus.
df <- df<- read_csv("df_full_id_year.csv")
set.seed(12)
df_1400 <- df %>%
select(id, year, title = Title, text =Text) %>%
filter(year>=1400, year < 1500)
df_1400 %>%
slice_sample(n=15) %>%
mutate(across(where(is.character), ~ substr(., 1, 800))) %>% # Truncate characters
kable() %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| id | year | title | text |
|---|---|---|---|
| 132123 | 1485 | I shal disceyue them by sum mene … | No Text |
| 132112 | 1485 | Vulgaria q[ue]da[m] abs There[n]cio in Anglicam linguam t[ra]ducta | No Text |
| 93563 | 1482 | [T]henne followyng this for Breton booke of Prolicronycon … | No Text |
| 132124 | 1485 | Thinges used ye may go surely em[on]g the peple … | No Text |
| 70992 | 1494 | [Here begynneth the table of a book entytled the book of good maners.] | No Text |
| 21283 | 1495 | KL Ianuarius habet dies .xxxi. Luna .xxx | No Text |
| 15475 | 1496 | Tenir per seruice de chiualer, est a tenir per homage foi alte … | Page [unnumbered] Page [unnumbered] Page [unnumbered] Dieu et mon droit Page [unnumbered] TEnir per seruice de chiualer: est a tenir per homage foi alte & estuage & tret a suy garder mariage & relif{is} ¶Et nota {qu} seruice de chiualer est seruice de terre ou de tene¦ment pour armes porter en guerre en defence du roial¦me. & doit garder mariage & relief {per} la raison que nul est able ne depoiar et ne puit auer conusaunce darmes porter auaunt {qu} il soit dage de .xxi. ans. Et pour taunt {qu} le seignour ne perdera ceo {qu} de droit doit auer & {qu} la poiar de la roialme de rien ne soit enfeble. ¶Le roy voet per cause de sountendre age que soun seignour luy auera en sa gard tan{que} al pleyn age de luy: cest assauoir .xxi. ans. tEnir per graunt sergeanti est si come vn home tient certei |
| 59659 | 1498 | Manipulus curatorum | No Text |
| 28047 | 1498 | In dei nomine amen nouerint vniversi cristifideles qualiter santissimi dominus felicis re | Page [unnumbered] In dei nomine Amen Nouerint vniuer•• cristifideles qualiter S•••••ssim{us} dominus noster felicis re¦cordacionis Innocentib{us} papa octau{us} co•••ssit de speciali preuilegio & gracis vt anima{rum} illo•um qui cum Caritate ab hac luce decesserint salus procuretur quod si qui parentes amici aut alii xpristifideles pietate comoti cuiusuis nacionib{us} et prouincie & vbicunq{que} fuerint. ac vbicū{que} degāt vicesimam partem vnius ducati pro anima vniuscuius{que} sic defuncti dederint aut miserint pro reedificatione hospitalis maioris apud sanctum Iacobum in cō¦postella nec non pro duarum capellarum in dicto hospitali fundacione quarum vna viris alia mulieribus tam dan∣tes & mittētes {quam} defuncti predicti In omnibus Suffragiis Precibus & Elimosinis. Ieiuniis. Oracion |
| 22927 | 1481 | Incipit tabula h[uius] libri | No Text |
| 143714 | 1489 | Lere [sic] endeth the book named the dictes or sayngis of the philosophers … | No Text |
| 54606 | 1498 | [A prognostication for 1498] | No Text |
| 110672 | 1499 | Incipit liber qui vocatur festiualis | No Text |
| 43803 | 1483 | Explanationes notabiles deuotissimi viri Richardi Hampole heremite sup[er] lectio[n]es illas Beati Iob … | No Text |
| 42336 | 1476 | Hic incipit paruus Catho | Page [unnumbered] Page [unnumbered] Page [unnumbered] Page [unnumbered] Hic Incipit paruus Catho •Vm aīduerterem quā hoīes grauiter errare Whanne I aduerte to my remembrance And see how sele folkes erren greuously In the wey of vertuouse gouernance I haue supposed in myn hert that I Oughte to supporte and conseille prudently Them to be vertuous in lyuing And how they shal them self in hōnour bring Igitur fili carissime docebo te quo pacto mores Therfore my leue child I shal now telle the Herken me wele the maner and the guyse How thi sowle inward shal acqueinted be With thewes good and vertues in al wyse Rede and conceiue for he is to dispise That redith ay and wote not what it ment Suche reding is no thing but wynde myspent Ita{que} deo supplica / parentes ama Pray thy god & preise hy |
[1] "Total number of books: 146323"
total_books_1400 <- df_1400 %>% summarise(n = n()) %>% pull(n)
paste("Total number of books from 1400s:", total_books_1400)[1] "Total number of books from 1400s: 408"
We only have 408 books from 1400s, out of the total 146,323 in the collection.
For easier interpretation of publishing trends over time, we organize
the data into 25-year intervals. Here we use all available data,
including the rows with No Text
df_bins<- df %>%
filter(!is.na(year), year > 1400, year < 1750) %>%
mutate(
year_bin = floor(year / 25) * 25,
year_bin_label = paste0(year_bin, "-", year_bin + 24)
)
plot <- df_bins %>%
count(year_bin_label) %>%
ggplot(aes(x = year_bin_label, y = n, text = paste("Count:", n))) +
geom_col(fill = "steelblue") +
labs(
title = "Books Published in 25-Year Intervals",
x = "",
y = ""
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(plot, tooltip = "text")Despite a slow start, the final decades of the 15th century saw a dramatic surge in published books:
df_1400_bins<- df_1400 %>%
mutate(
year_bin = floor(year / 10) * 10,
year_bin_label = paste0(year_bin, "-", year_bin + 9)
)
plot_1400 <- df_1400_bins %>%
count(year_bin_label) %>%
ggplot(aes(x = year_bin_label, y = n, text = paste("Count:", n))) +
geom_col(fill = "steelblue") +
labs(
title = "1400s Books Published in 10-Year Intervals",
x = "",
y = ""
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplotly(plot_1400, tooltip = "text")This sharp growth aligns closely with the introduction of the printing press to England by William Caxton in 1476. The explosion in book production during the 1480s and 1490s reflects both technological adoption and rising demand for printed texts across religious, legal, and philosophical domains.
df_1400_summary <- df_1400 %>%
mutate(
title_char_count = nchar(title),
title_word_count = str_count(title, "\\S+"),
text_char_count = nchar(text),
text_word_count = str_count(text, "\\S+")
)
df_1400_summary %>%
summarise(
avg_title_words = mean(title_word_count, na.rm = TRUE),
avg_text_words = mean(text_word_count, na.rm = TRUE),
median_text_words = median(text_word_count, na.rm = TRUE),
min_text_words = min(text_word_count, na.rm = TRUE),
max_text_words = max(text_word_count, na.rm = TRUE),
total_books = n()
) %>%
kable(
caption = "Summary Statistics for 1400s EEBO Books",
digits = 1, # rounds to 1 decimal place
format = "markdown"
)| avg_title_words | avg_text_words | median_text_words | min_text_words | max_text_words | total_books |
|---|---|---|---|---|---|
| 12.6 | 18220.2 | 2 | 2 | 694182 | 408 |
As we previously saw, many of the books have No Text in
their text column, this could be for many reasons, the title may exist
but EEBO collection does not have the full text.
Next, we narrow our analysis to books that contain actual textual content, excluding entries labeled “No Text” or those with only two words or fewer. This helps us focus on documents with meaningful substance.
df_1400_text <- df_1400_summary %>%
filter(text != "No Text")
df_1400_text %>%
summarise(
avg_title_words = mean(title_word_count, na.rm = TRUE),
avg_text_words = mean(text_word_count, na.rm = TRUE),
median_text_words = median(text_word_count, na.rm = TRUE),
min_text_words = min(text_word_count, na.rm = TRUE),
max_text_words = max(text_word_count, na.rm = TRUE),
total_books = n()
) %>%
kable(
caption = "Summary Statistics for 1400s EEBO Books with Text",
digits = 1, # rounds to 1 decimal place
format = "markdown"
)| avg_title_words | avg_text_words | median_text_words | min_text_words | max_text_words | total_books |
|---|---|---|---|---|---|
| 17.9 | 58072.5 | 19613.5 | 13 | 694182 | 128 |
To make the content of 15th-century books accessible for modern research and analysis, we use Large Language Models (LLMs) to translate both titles and full texts into contemporary English.
The process included several key steps:
Translation via API: Using a batch-processing pipeline, we passed each book’s original title and text through an LLM (via API), producing two new fields:
modern_title: a plain English version of the original
titlemodern_text: the translated full text, preserving
sentence structure and meaningNoise tolerance: The prompt was crafted to be tolerant of OCR noise and early-modern spellings, instructing the model to focus on semantic clarity while ignoring non-linguistic artifacts.
After filtering, only books with substantive content were included in the final translated dataset.
The result is a modernized corpus of early printed texts that can now be used for downstream research, including topic modeling, keyword extraction, and content-based historical analysis.
Below is an example of how the GenAI-powered translation pipeline transforms early printed English into modern, readable language.
set.seed(14)
df_1400_translated <- read_csv("df_translated_1400s.csv")
df_1400_translated %>%
slice_sample(n=5) %>%
mutate(across(where(is.character), ~ substr(., 1, 800))) %>% # Truncate characters
kable() %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| id | year | modern_title | modern_text |
|---|---|---|---|
| 54608 | 1498 | The Death of Arthur |
Sir Tristram was banished from Cornwall for ten years after a council with Sir Fergus and others. He swore an oath before the king and his barons and left the country, accompanied by many barons, some friends and some foes. Meanwhile, a knight named Dinadan came looking for Sir Tristram and was directed to him as he was preparing to board his ship. Dinadan asked Tristram to joust with him before he left, and Tristram agreed, with the barons’ permission. They charged at each other, and Tristram knocked Dinadan down. Dinadan then requested to join Tristram’s company, which Tristram welcomed, and they rode together to their ships. Once at sea, Tristram sent greetings to King Mark and his enemies, promising to return when he could. He reflected on his past battles, feeling he had been well re |
| 145254 | 1497 | Here begins a treatise called Perula | Here begins a treatise called “Perula.” A depiction of a master with his students. What should you do when you need to translate English into Latin? I will first repeat my English once, twice, or three times, and look for my main verb, then ask this question: who or what? The word that answers the question will be the nominative case for the verb. But if it is an impersonal verb, as in this example (The master teaches students), “teaches” is the verb. Who teaches? The master teaches. The master will not be the nominative case for the verb. The word that comes after the verb will most commonly be the accusative case, as in (Magister docet me). When I have an adjective noun, pronoun, participle, or relative pronoun, I will ask the same question: who or what? The word that answers will be the |
| 54610 | 1498 | A prediction for 1498 | In the second quarter of the year, a king will likely embark on a journey outside his realm, but if he does, he will not return home because of his enemies. Kings and princes will face troubles and sadness this year due to conflicts among themselves and uprisings from their subjects. Therefore, it is wise for kings and princes to keep themselves away from the common people and from poisons and other dangers. It would not be advisable for them to start a war in the first three quarters of the year, nor to send their men to war from mid-August to October 20. If war does arise, they should resist it through preparation and delay, as that would be best. Ultimately, kings and princes will defeat their adversaries, as I predicted last year, which the Cornishmen could have avoided if they had bee |
| 93568 | 1482 | Little Long |
As I must go to the master, I need to go to the teacher. Good students enjoy learning. However, sometimes they use impersonal verbs and have nominative cases before them. For example, “This thing pleases me” and “This matter concerns me.” Therefore, the verb “interest” is constructed with a genitive case of nouns and an ablative case of pronouns like “my,” “your,” “his,” and “our.” In this example, it is for my benefit: “It concerns me.” It is for your benefit: “It concerns you.” It is for his benefit: “It concerns him.” Also, the verb “interest” after “precisely” is constructed with an ablative case of the feminine gender of this possessive noun: “Whose.” Therefore, “It pertains to me” and not “to me.” All the impersonal verbs in the following verses will be constructed with a dative ca |
| 143565 | 1478 | The Horse, the Sheep, and the Geese |
The horse, the sheep, and the geese. Disputes, pleas, and disagreements between people were often resolved by looking for evidence in records. This was the custom of ancient times. Judges were appointed who had authority to hear cases that were presented fairly between the parties. People of all ranks were allowed to present their arguments based on the law that governed their disputes. Without exception, they were called to speak in front of the president, who first commanded silence. The parties were free to express their grievances and declare their rights. To conclude this matter briefly, not long ago, I will find a fitting analogy. A scene was skillfully painted on a wall showing two figures sitting in royal status: the fierce lion, famous in all kingdoms, and the imperial eagle, pie |
To get a high-level view of recurring themes and vocabulary, we extract the most common words across all translated 1400s texts. After removing stopwords and short noise terms, we reveal the dominant lexical patterns that shaped early printed thought.
top_words_cumulative <- df_1400_translated %>%
select(id, modern_text) %>%
unnest_tokens(word, modern_text) %>%
filter(str_detect(word, "^[a-z]+$"), nchar(word) > 3) %>% # Clean words
anti_join(stop_words, by = "word") %>%
count(word, sort = TRUE) %>%
slice_max(n, n = 20)
top_words_cumulative %>%
kable(col.names = c("Word", "Frequency"), caption = "Top 20 Most Common Words Across All Translated Texts") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Word | Frequency |
|---|---|
| lord | 125 |
| king | 102 |
| christ | 75 |
| heart | 62 |
| jesus | 62 |
| love | 61 |
| saint | 53 |
| world | 53 |
| father | 51 |
| holy | 51 |
| called | 48 |
| soul | 46 |
| time | 45 |
| death | 44 |
| people | 44 |
| body | 42 |
| grace | 40 |
| life | 40 |
| mercy | 39 |
| person | 37 |
While common word frequency highlights broad themes, TF-IDF (term frequency–inverse document frequency) helps us surface the most distinctive words from each book. Below, we calculate the top 5 unique words per book and visualize their weighted importance across the entire 1400s corpus using a word cloud.
Loading required package: RColorBrewer
# Tokenize and calculate tf-idf
tfidf_words <- df_1400_translated %>%
select(id, modern_text) %>%
unnest_tokens(word, modern_text) %>%
filter(str_detect(word, "^[a-z]+$"), nchar(word) > 3) %>% # clean words
anti_join(stop_words, by = "word") %>%
count(id, word, sort = TRUE) %>%
bind_tf_idf(word, id, n) %>%
group_by(id) %>%
slice_max(tf_idf, n = 5) %>%
ungroup()
tfidf_words %>%
count(word, wt = tf_idf, sort = TRUE) %>%
with(wordcloud(words = word, freq = n, max.words = 100))We’ll simulate this by feeding GPT a bundle of book summaries or full texts from the 1400s and asking it to synthesize what it learns.
Here’s running the API on the full data sample of about 30 books
The translated excerpts from 15th-century English texts reveal a rich tapestry of themes and concerns that reflect the worldview of the time. Central to these writings is a profound engagement with spiritual and moral questions—particularly the relationship between the divine and the individual, the nature of virtue and vice, and the societal structures that govern human conduct. These texts consistently emphasize faith, obedience to God, and the pursuit of virtue as essential for both personal salvation and communal harmony.
One prominent theme is the concept of spiritual marriage, especially within the consecration of religious sisters. The texts liken a woman’s commitment to Christ to a marital bond, requiring the renunciation of worldly attachments and ambitions. This metaphor signals a higher spiritual calling—one that transcends social roles and earthly desires. Such narratives reflect a deeply hierarchical worldview, where divine order is mirrored in the social order, and any moral decay among the clergy or nobility threatens the spiritual and social fabric of society.
The behavior of rulers and clergy is a recurring focus. Leaders are called upon to embody justice, humility, and wisdom, while being warned against the perils of greed and corruption. The moral health of society is seen as a reflection of its leaders’ character. Exhortations directed at priests and nobles urge them to lead by example, reinforcing the belief that the virtues—or vices—of those in power have a ripple effect throughout the entire community.
Another striking aspect is the concern for the afterlife and the fate of souls. These texts reflect the medieval preoccupation with death, judgment, and eternal salvation. Confession, penance, and the intercession of saints are portrayed as vital spiritual practices. Notably, the well-being of the dead is tied to the actions of the living—through prayers and acts of charity—highlighting a strong ethos of communal responsibility for spiritual outcomes.
In total, these 15th-century texts articulate a worldview that is:
These documents offer more than historical insight—they provide a vivid lens into how people of the time understood virtue, duty, leadership, and the human journey toward redemption.
Here’s a well-formatted version of your llm_output,
ready to use in Substack, a PDF, or research summary. Each excerpt is
titled, quoted, and paired with concise, insightful commentary:
Prompting the LLM API the following:
Prompt:
Based on the following translated excerpts from 15th-century English books,
identify 5–7 specific passages or quotations that are especially revealing or insightful.
For each one, explain what it tells us about the worldview, values, or concerns of the period.
You may focus on themes like spirituality, governance, morality, death, gender, or community.
Present each excerpt followed by your commentary in 2–3 sentences.
1. Marriage to Christ
“This marriage must be desired by you with all your heart and free will, not compelled by any of your friends, nor for worldly honor or comfort, but only for the love and service you owe to the mighty prince…”
Commentary: This excerpt reveals the deep spiritual values of the 15th century, emphasizing personal choice and devotion in religious life. The metaphor of marrying Christ illustrates a profound commitment to spiritual purity and divine service, reflecting the era’s belief in fulfillment through religious dedication rather than worldly attachments.
2. Bishop John of Ely’s Address
“He emphasized the importance of recognizing the role of priests in the downfall of the community and lamented the moral decay among the clergy.”
Commentary: This passage reflects anxiety over moral decline within the Church. It underscores the essential role of clergy as ethical exemplars, and the societal risk posed by their failure. Calls for reform like this illustrate broader cultural concerns about spiritual leadership and public morality.
3. The Art of Dying
“When someone is likely to die, it is most important to have a special friend who will sincerely help and pray for them…”
Commentary: This excerpt captures the communal aspect of death in medieval society. Spiritual preparation for dying was a shared responsibility, where friends played a key role in prayer and moral support, reflecting the deeply social and religious understanding of mortality.
4. Moral Sayings of Christians
“Without justice, no land or country can thrive in good prosperity. Without faith, no creature can please God…”
Commentary: The text encapsulates the belief that societal success hinges on moral integrity and religious faith. Rather than wealth or power, virtues like justice and faith were seen as the true cornerstones of both individual and collective well-being.
5. Pope Innocent VIII’s Letter
“If any parents, friends, or other Christians… give or send one-twentieth of a ducat for the soul of any such deceased person… both the donors and the deceased will be participants in all the suffrages…”
Commentary: This passage highlights the medieval belief in indulgences and spiritual economy. Material donations were thought to secure spiritual benefits, not only for the deceased but also for the donor—an example of the era’s blend of religious, social, and economic life.
6. Governance and Restraint of Desire
“Therefore, emperors, princes, and governors of the past have wisely and politically established various laws to restrain such excessive desires…”
Commentary: This excerpt reflects a vision of governance grounded in moral responsibility. Leaders were expected to curb human vice through wise laws, upholding social harmony and divine justice—suggesting that political leadership was a sacred duty.
7. Spiritual Plea for Mercy
“I cry for God’s mercy, I cry for God’s mercy, I cry for God’s mercy…”
Commentary: A raw and emotional expression of penitence, this repeated plea for mercy shows the intensity of religious devotion. It echoes a worldview in which human frailty demands constant repentance and divine grace, core themes in medieval spiritual life.