Overview
Out of a total of 132,846 EEBO records, 60,227 (45.34%) are in
EEBO-TCP (but 66 EEBO records have multiple TCP ids).
The 60,327 EEBO-TCP records are divided into phase 1 and phase 2. In
detail, phase 1 contains 25,368 of these records (42.05%) while phase 2
contains 34,959 of these records (57.95%). In terms of EEBO, 25,304
records (19.05%) are in EEBO-TCP phase 1, while 34,931 records (26.29%)
are in EEBO-TCP phase 2.
In terms of the ESTC, out of the 132,846 EEBO records, 6,802 (5.12%)
could not be matched to an ESTC record and will be left out of the
analysis. On the other hand, 7,373 EEBO records (5.55%) were matched to
more than one ESTC record, possibly causing bias.
Out of the 60,327 EEBO-TCP records, 1,143 (1.89%) could not be
matched to an ESTC record and will be left out of the analysis. On the
other hand, 3,269 EEBO-TCP records (5.42%) were matched to more than one
ESTC record, possibly causing bias.
In the analysis, only ESTC records with publication years in the
range [1474,1700) have been included. This results in the exclusion of
4,862 (4.17%) ESTC records that have representation in EEBO, possibly
causing bias. 2,119 (3.41%) of the ESTC records with representation in
EEBO-TCP are removed due to this filtering condition.
In the end, our working dataset:
- From the viewpoint of EEBO, contains 121,328 (91.33%) out of the
original 132,846 EEBO ids.
- From the viewpoint of EEBO-TCP, contains 57,461 (95.25%) out of the
original 60,327 EEBO ids.
- Consists of 132,412 ESTC records, of which 111,816 (84.45%) we
estimate to have representation in EEBO, and 60,095 (45.38%) to have
representation in EEBO-TCP.
Publication type analysis
Coverage of different publication types in EEBO
library(ggbeeswarm)
bind_rows(
df %>% mutate(group = "Editions"),
df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
filter(type %in% c("Book","Pamphlet")) %>%
group_by(publication_year, edition_type, group, type, in_eebo) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo) %>%
ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
geom_quasirandom(aes(size = tn), dodge = 1.0) +
stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
theme_hsci_discrete() +
xlab(NULL) +
ylab("EEBO coverage") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
theme(legend.justification = c(0, 0), legend.position = c(0.02, 0.02), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
labs(color = "Representation type", size = "Count") +
guides(shape = "none")

In terms of coverage of ESTC’s pre-18th-century material, EEBO is
quite good, with a median coverage of about 95% of books both at the
edition as well as the work-level, with only a slight drop in coverage
for later year editions (meaning that even for later editions, EEBO
often contains at least one edition from each year, but may not contain
all distinct printings from that year).
For pamphlets, coverage is about 85% across the board, with an
interesting increase for later year editions (this may be caused either
by reprinted pamphlets having been though of as important to capture, or
due to e.g. temporal artifacts, even though it does not appear that
overall coverage improves with time, as seen later).
Coverage of different publication types in EEBO-TCP
library(ggbeeswarm)
bind_rows(
df %>% mutate(group = "Editions"),
df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
filter(type %in% c("Book","Pamphlet")) %>%
group_by(publication_year, edition_type, group, type, in_eebo_tcp) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp) %>%
ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
geom_quasirandom(aes(size = tn), dodge = 1.0) +
stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
theme_hsci_discrete() +
xlab(NULL) +
ylab("EEBO-TCP coverage") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
theme(legend.justification = c(0, 1), legend.position = c(0.02, 0.98), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
labs(color = "Representation type", size = "Count") +
guides(shape = "none")

Coverage of different publication types in EEBO-TCP phase 1
library(ggbeeswarm)
bind_rows(
df %>% mutate(group = "Editions"),
df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
filter(type %in% c("Book","Pamphlet")) %>%
group_by(publication_year, edition_type, group, type, in_eebo_tcp_phase_1) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp_phase_1) %>%
ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
geom_quasirandom(aes(size = tn), dodge = 1.0) +
stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
theme_hsci_discrete() +
xlab(NULL) +
ylab("EEBO-TCP phase 1 coverage") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
theme(legend.justification = c(0, 1), legend.position = c(0.02, 0.98), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
labs(color = "Representation type", size = "Count") +
guides(shape = "none")

Coverage of different publication types in EEBO-TCP phase 2
library(ggbeeswarm)
bind_rows(
df %>% mutate(group = "Editions"),
df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
filter(type %in% c("Book","Pamphlet")) %>%
group_by(publication_year, edition_type, group, type, in_eebo_tcp_phase_2) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp_phase_2) %>%
ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
geom_quasirandom(aes(size = tn), dodge = 1.0) +
stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
theme_hsci_discrete() +
xlab(NULL) +
ylab("EEBO-TCP phase 2 coverage") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
theme(legend.justification = c(0, 1), legend.position = c(0.02, 0.98), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
labs(color = "Representation type", size = "Count") +
guides(shape = "none")

For coverage in EEBO-TCP, a clear pattern emerges where coverage of
singular and first editions is much better than coverage of later
editions. There are also no clear differences between EEBO-TCP phase 1
and phase 2 in behavior with regard to this. This has an important
bearing for all following analyses, which in the case of EEBO-TCP,
should mostly evaluate coverage on this work-level. As a separate
observation, interestingly, coverage of books and pamphlets also seems
quite even even. Another observation is that EEBO-TCP phase 2 contains
more singular works than phase 1. This may indicate a broader collection
of “non-core” works, instead of focusing on first editions of popular
(and thus later reprinted) works.
Edition-level temporal overview
df %>% mutate(g = case_when(
!certain ~ "Uncertain dating",
in_eebo_tcp_phase_1 ~ "In EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "In EEBO-TCP phase 2",
in_eebo ~ "In EEBO",
T ~ "ESTC total",
)) %>%
ggplot(aes(x = publication_year, fill = fct_relevel(g, "Uncertain dating", "ESTC total", "In EEBO","In EEBO-TCP phase 1","In EEBO-TCP phase 2"))) +
geom_bar(width = 1) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(breaks = seq(0, 10000, by = 500)) +
xlab("Year") +
ylab("ESTC entries") +
theme(legend.justification = c(0, 1), legend.position = c(0.05, 0.95), legend.background = element_blank(), legend.key = element_blank()) +
labs(fill = NULL) +
guides(fill = guide_legend(reverse = TRUE))

In terms of a temporal overview, it is important to note how here in
an absolute graph, the amount of entries grows significantly overall
through time, as well as has large variations and spikes multiple times
between 1640 and 1700 (with the larger bump between 1640 and 1660 most
likely consisting mainly of the Thomason Tracts).
df %>% filter(certain) %>% mutate(g = case_when(
in_eebo_tcp_phase_1 ~ "In EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "In EEBO-TCP phase 2",
in_eebo ~ "In EEBO",
T ~ "Not in EEBO",
)) %>%
ggplot(aes(x = publication_year, fill = fct_relevel(g, "Not in EEBO", "In EEBO","In EEBO-TCP phase 2","In EEBO-TCP phase 1"))) +
geom_bar(width = 1,position='fill') +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1),labels=scales::percent_format(accuracy=1)) +
xlab("Year") +
ylab("Proportion of ESTC entries") +
theme(legend.position="bottom") +
labs(fill = NULL) +
guides(fill = guide_legend(reverse = TRUE))

In terms of edition-level proportional coverage, EEBO coverage is
quite balanced throughout the period, with just a slight drop at the end
of the 17th century. For EEBO-TCP, edition-level coverage is much more
varied, but as noted, it actually does not make that much sense to look
at edition-level coverage with respect to it.
Work-level temporal overview
df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),certain=any(first_year_publication & certain),.groups="drop") %>%
mutate(g = case_when(
!certain ~ "Uncertain dating",
in_eebo_tcp_phase_1 ~ "In EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "In EEBO-TCP phase 2",
in_eebo ~ "In EEBO",
T ~ "ESTC total",
)) %>%
ggplot(aes(x = first_publication_year, fill = fct_relevel(g, "Uncertain dating", "ESTC total", "In EEBO","In EEBO-TCP phase 2","In EEBO-TCP phase 1"))) +
geom_bar(width = 1) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(breaks = seq(0, 10000, by = 500)) +
xlab("Year") +
ylab("ESTC entries") +
theme(legend.justification = c(0, 1), legend.position = c(0.05, 0.95), legend.background = element_blank(), legend.key = element_blank()) +
labs(fill = NULL) +
guides(fill = guide_legend(reverse = TRUE))

df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),certain=any(first_year_publication & certain),.groups="drop") %>% mutate(g = case_when(
in_eebo_tcp_phase_1 ~ "In EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "In EEBO-TCP phase 2",
in_eebo ~ "In EEBO",
T ~ "Not in EEBO",
)) %>%
filter(certain) %>%
ggplot(aes(x = first_publication_year, fill = fct_relevel(g, "Not in EEBO", "In EEBO","In EEBO-TCP phase 2","In EEBO-TCP phase 1"))) +
geom_bar(width = 1,position='fill') +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1),labels=scales::percent_format(accuracy=1)) +
xlab("Year of first publication") +
ylab("Proportion of ESTC works") +
theme(legend.position="bottom") +
labs(fill = NULL) +
guides(fill = guide_legend(reverse = TRUE))

In terms of work-level coverage, also EEBO-TCP appears quite nicely
balanced temporally, apart from dips between 1500 and 1530. However, it
must be noted how the total amount of content is also very low for those
early years, so larger variation can also be expected. The addition of
phase 2 improves the evenness of EEBO-TCP coverage a bit with regard to
phase 1, where coverage diminishes toward the end of the century.
Document type coverage through time
bind_rows(
df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
mutate(group = "Works")
) %>%
mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
filter(certain) %>%
filter(!is.na(type),type!="In-between") %>%
group_by(publication_year, type, in_eebo) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo) %>%
ggplot(aes(x = publication_year, y = prop, color = type)) +
geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))

Drilling in and separating books and pamphlets from each other, we
can see that EEBO coverage of both is very good, apart from a noticeable
drop in pamphlet coverage in the late 17th century.
bind_rows(
df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
mutate(group = "Works")
) %>%
mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
filter(certain) %>%
filter(!is.na(type),type!="In-between") %>%
group_by(publication_year, type, in_eebo_tcp) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp) %>%
ggplot(aes(x = publication_year, y = prop, color = type)) +
geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO-TCP coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))

For EEBO-TCP, on the work level, the same drop in coverage for
pamphlets at the end of the 17th century can be seen, but otherwise
coverage is relatively stable through time for both books as well as
pamphlets, except for a marked uptick between 1640 and 1660 (caused most
likely by more judicious inclusion of the Thomason Tracts). On the work
level, pamphlets are just slightly better covered than books, but on the
on the edition level, coverage of books is much lower. This can be seen
as only the natural consequence of EEBO-TCP favouring including only
first editions. Books typically have more editions than pamphlets, so
excluding later editions affects edition-level coverage for books much
more than it does for pamphlets.
bind_rows(
df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),certain=any(first_year_publication & certain),.groups="drop") %>%
mutate(group = "Works")
) %>%
mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
filter(certain) %>%
filter(!is.na(type),type!="In-between") %>%
group_by(publication_year, type, in_eebo_tcp_phase_1) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp_phase_1) %>%
ggplot(aes(x = publication_year, y = prop, color = type)) +
geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO-TCP phase 1 coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))

Looking only at phase 1, there is a clear bump in the representation
of pamphlets in the 1560s, which interestingly is corrected for when
taking in also phase 2. In terms of book coverage, there is also a
linear decline in coverage between about 1560 and 1650 (before 1540 the
data is so sparse that reliable conclusions cannot be drawn from
it).
Topical coverage EEBO-TCP vs EEBO
EEBO work-level genre use frequencies
(subset that is in ESTC to get the work information)
eebo_ustc_genres %>%
inner_join(eebo_core,by=c("eebo_id")) %>%
inner_join(estc_core,by=c("estc_id")) %>%
mutate(ustc_genre=str_trunc(ustc_genre,65),status=case_when(
in_eebo_tcp_phase_1 ~ "EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "EEBO-TCP phase 2",
T ~ "Not in EEBO-TCP")) %>%
group_by(ustc_genre,status) %>%
summarize(n=n_distinct(work_id),.groups="drop") %>%
group_by(ustc_genre) %>%
mutate(tn=sum(n)) %>%
ungroup() %>%
mutate(ustc_genre=fct_reorder(ustc_genre,tn)) %>%
ggplot(aes(x=ustc_genre,y=n,fill=status)) +
geom_col() +
theme_hsci_discrete() +
theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(fill=NULL) +
xlab("USTC genre") +
ylab("Number of works") +
scale_y_continuous(labels=scales::number) +
coord_flip()

Open question: are the USTC categories usable? Is this a believable
genre distribution? If it is, the below graphs show interesting
difference and temporal shifts in the coverage of the various
categories, the interpretation of which I leave up to you.
EEBO-TCP work-level genre coverage
eebo_ustc_genres %>%
inner_join(eebo_core,by=c("eebo_id")) %>%
inner_join(estc_core,by=c("estc_id")) %>%
mutate(ustc_genre=str_trunc(ustc_genre,65)) %>%
group_by(work_id,ustc_genre) %>%
summarize(in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
group_by(ustc_genre) %>%
summarize(n=n(),`EEBO-TCP`=sum(in_eebo_tcp)/n(),`EEBO-TCP phase 1`=sum(in_eebo_tcp_phase_1)/n(),`EEBO-TCP phase 2`=sum(in_eebo_tcp_phase_2)/n(),.groups="drop") %>%
mutate(ustc_genre=fct_reorder(str_c(ustc_genre,' (',n,')'),n)) %>%
pivot_longer(`EEBO-TCP phase 1`:`EEBO-TCP phase 2`,names_to="part", values_to = "prop") %>%
ggplot(aes(x=ustc_genre,y=prop,fill=fct_relevel(part,'EEBO-TCP phase 2'))) +
geom_col(position='stack') +
theme_hsci_discrete() +
scale_y_continuous(labels=scales::percent_format(accuracy=1)) +
xlab("USTC genre") +
ylab("Coverage by work") +
theme(legend.position = "bottom") +
# theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(fill=NULL) +
coord_flip()

Here, we can first see which categories have been excluded from
EEBO-TCP: almanacs,academic dissertations, astrology and cosmography, as
well as dictionaries. Apart from this, we also see how poetry and drama
are heavily overemphasized in EEBO-TCP phase 1, whereas phase 2 corrects
nicely for these as well as other imbalances. What remains interesting
is a low coverage of dialectics and rhetoric, linguistics and philology
and classical authors.
EEBO-TCP phase genre composition comparison
EEBO work-level language frequencies
eebo_core %>%
inner_join(estc_core,by=c("estc_id")) %>%
mutate(status=case_when(
in_eebo_tcp_phase_1 ~ "EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "EEBO-TCP phase 2",
T ~ "Not in EEBO-TCP")) %>%
group_by(eebo_tls_language,status) %>%
summarize(n=n_distinct(work_id),.groups="drop") %>%
group_by(eebo_tls_language) %>%
mutate(tn=sum(n)) %>%
ungroup() %>%
mutate(eebo_tls_language=fct_reorder(eebo_tls_language,tn)) %>%
ggplot(aes(x=eebo_tls_language,y=n,fill=status)) +
geom_col() +
theme_hsci_discrete() +
theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(fill=NULL) +
xlab("Language") +
ylab("Number of works (log10)") +
scale_y_continuous(labels=scales::number,trans="log10") +
coord_flip()

EEBO-TCP work-level language coverage vs EEBO
eebo_core %>%
filter(!is.na(eebo_tls_language)) %>%
inner_join(estc_core,by=c("estc_id")) %>%
group_by(work_id,eebo_tls_language) %>%
summarize(in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
group_by(eebo_tls_language) %>%
summarize(n=n(),`EEBO-TCP`=sum(in_eebo_tcp)/n(),`EEBO-TCP phase 1`=sum(in_eebo_tcp_phase_1)/n(),`EEBO-TCP phase 2`=sum(in_eebo_tcp_phase_2)/n(),.groups="drop") %>%
mutate(eebo_tls_language=fct_reorder(str_c(eebo_tls_language,' (',n,')'),n)) %>%
pivot_longer(`EEBO-TCP phase 1`:`EEBO-TCP phase 2`,names_to="part", values_to = "prop") %>%
ggplot(aes(x=eebo_tls_language,y=prop,fill=part)) +
geom_col(position='stack') +
theme_hsci_discrete() +
scale_y_continuous(labels=scales::percent_format(accuracy=1)) +
xlab("Language") +
ylab("Coverage by work as compared to EEBO") +
theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(fill=NULL) +
coord_flip()

Welsh and Scottish are very well covered. Of the major languages,
Latin in particular is very poorly covered overall, and particularly in
phase 2 (which we already knew from the background info at https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/).
French fares a bit better, but not too great.
EEBO-TCP phase language composition comparison (against EEBO,
English excluded)
EEBO-TCP phase edition type composition comparison
Genre coverage through time (EEBO-TCP against EEBO)

In many major genre categories such as religious, literature and
history and chronicles, phase 1 of EEBO-TCP shows a clearly diminishing
coverage toward the end of the century. However, when phase 2 is added
to the data, in addition to significantly improving coverage overall,
this bias disappears.
Topical coverage of EEBO vs ESTC through time
Here, we are projecting subject category information from EEBO/ECCO
throughout the whole of the ESTC in order to compare their coverage. For
the 18th century and ECCO, this seemed to work relatively well for all
the 8 categories. For USTC/EEBO, I was comfortable including only the
religious/history and chronicles and economics -categories.
Using projected ECCO modules
df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
filter(certain) %>%
left_join(combined_projected_ecco_modules,by=c("work_id")) %>%
replace_na(list(projected_ecco_module="Other/Unknown")) %>%
mutate(projected_ecco_module=fct_relevel(projected_ecco_module,"Other/Unknown")) %>%
group_by(first_publication_year, projected_ecco_module, in_eebo) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo) %>%
ggplot(aes(x = first_publication_year, y = prop, color = projected_ecco_module)) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
geom_smooth(aes(weight = n, fill = projected_ecco_module), span = 0.3, method='loess',formula=y~x) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 40)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
coord_cartesian(ylim=c(0.5,1)) +
scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) +
guides(color="none",fill="none") +
facet_wrap(~projected_ecco_module,ncol=3)

Using projected USTC Religious/History and chronicles/Economics
df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
filter(certain) %>%
left_join(estc_projected_ustc_genres %>%
filter(max_prop>=0.7,projected_ustc_genre %in% c("Religious","History and chronicles","Economics")),by=c("work_id")) %>%
replace_na(list(projected_ustc_genre="Other/Unknown")) %>%
mutate(projected_ustc_genre=fct_relevel(projected_ustc_genre,"Other/Unknown")) %>%
group_by(first_publication_year, projected_ustc_genre, in_eebo) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo) %>%
ggplot(aes(x = first_publication_year, y = prop, color = projected_ustc_genre)) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
geom_smooth(aes(weight = n, fill = projected_ustc_genre), span = 0.3, method='loess',formula=y~x) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
coord_cartesian(ylim=c(0,1)) +
scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0))

Topical coverage of EEBO-TCP vs ESTC through time
Using projected ECCO modules
df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
filter(certain) %>%
left_join(combined_projected_ecco_modules,by=c("work_id")) %>%
replace_na(list(projected_ecco_module="Other/Unknown")) %>%
mutate(projected_ecco_module=fct_relevel(projected_ecco_module,"Other/Unknown")) %>%
group_by(first_publication_year, projected_ecco_module, in_eebo_tcp) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp) %>%
ggplot(aes(x = first_publication_year, y = prop, color = projected_ecco_module)) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
geom_smooth(aes(weight = n, fill = projected_ecco_module), span = 0.3, method='loess',formula=y~x) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 40)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO-TCP coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
coord_cartesian(ylim=c(0,1)) +
scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) +
guides(color="none",fill="none") +
facet_wrap(~projected_ecco_module,ncol=3)

Using projected USTC Religious/History and chronicles/Economics
df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
filter(certain) %>%
left_join(combined_projected_ustc_genres %>%
filter(projected_ustc_genre %in% c("Religious","History and chronicles","Economics")),by=c("work_id")) %>%
replace_na(list(projected_ustc_genre="Other/Unknown")) %>%
mutate(projected_ustc_genre=fct_relevel(projected_ustc_genre,"Other/Unknown")) %>%
group_by(first_publication_year, projected_ustc_genre, in_eebo_tcp) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp) %>%
ggplot(aes(x = first_publication_year, y = prop, color = projected_ustc_genre)) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
geom_smooth(aes(weight = n, fill = projected_ustc_genre), span = 0.3, method='loess',formula=y~x) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO-TCP coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
coord_cartesian(ylim=c(0,1)) +
scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0))

(compare this with the raw EEBO-TCP vs EEBO coverage as well as the
ECCO module coverage graphs)
