Overview
Out of a total of 132,846 EEBO records, 60,227 (45.34%) are in
EEBO-TCP (but 66 EEBO records have multiple TCP ids).
The 60,327 EEBO-TCP records are divided into phase 1 and phase 2. In
detail, phase 1 contains 25,368 of these records (42.05%) while phase 2
contains 34,959 of these records (57.95%). In terms of EEBO, 25,304
records (19.05%) are in EEBO-TCP phase 1, while 34,931 records (26.29%)
are in EEBO-TCP phase 2.
In terms of the ESTC, out of the 132,846 EEBO records, 6,802 (5.12%)
could not be matched to an ESTC record and will be left out of the
analysis. On the other hand, 7,373 EEBO records (5.55%) were matched to
more than one ESTC record, possibly causing bias.
Out of the 60,327 EEBO-TCP records, 1,143 (1.89%) could not be
matched to an ESTC record and will be left out of the analysis. On the
other hand, 3,269 EEBO-TCP records (5.42%) were matched to more than one
ESTC record, possibly causing bias.
In the analysis, only ESTC records with publication years in the
range [1474,1700) have been included. This results in the exclusion of
4,862 (4.17%) ESTC records that have representation in EEBO, possibly
causing bias. 2,119 (3.41%) of the ESTC records with representation in
EEBO-TCP are removed due to this filtering condition.
In the end, our working dataset:
- From the viewpoint of EEBO, contains 121,328 (91.33%) out of the
original 132,846 EEBO ids.
- From the viewpoint of EEBO-TCP, contains 57,461 (95.25%) out of the
original 60,327 EEBO ids.
- Consists of 132,412 ESTC records, of which 111,816 (84.45%) we
estimate to have representation in EEBO, and 60,095 (45.38%) to have
representation in EEBO-TCP.
Publication type analysis
Coverage of different publication types in EEBO
library(ggbeeswarm)
bind_rows(
df %>% mutate(group = "Editions"),
df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
filter(type %in% c("Book","Pamphlet")) %>%
group_by(publication_year, edition_type, group, type, in_eebo) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo) %>%
ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
geom_quasirandom(aes(size = tn), dodge = 1.0) +
stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
theme_hsci_discrete() +
xlab(NULL) +
ylab("EEBO coverage") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
theme(legend.justification = c(0, 0), legend.position = c(0.02, 0.02), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
labs(color = "Representation type", size = "Count") +
guides(shape = "none")

In terms of coverage of ESTC’s pre-18th-century material, EEBO is
quite good, with a median coverage of about 95% of books both at the
edition as well as the work-level, with only a slight drop in coverage
for later year editions (meaning that even for later editions, EEBO
often contains at least one edition from each year, but may not contain
all distinct printings from that year).
For pamphlets, coverage is about 85% across the board, with an
interesting increase for later year editions (this may be caused either
by reprinted pamphlets having been though of as important to capture, or
due to e.g. temporal artifacts, even though it does not appear that
overall coverage improves with time, as seen later).
Coverage of different publication types in EEBO-TCP
library(ggbeeswarm)
bind_rows(
df %>% mutate(group = "Editions"),
df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
filter(type %in% c("Book","Pamphlet")) %>%
group_by(publication_year, edition_type, group, type, in_eebo_tcp) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp) %>%
ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
geom_quasirandom(aes(size = tn), dodge = 1.0) +
stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
theme_hsci_discrete() +
xlab(NULL) +
ylab("EEBO-TCP coverage") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
theme(legend.justification = c(0, 1), legend.position = c(0.02, 0.98), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
labs(color = "Representation type", size = "Count") +
guides(shape = "none")

Coverage of different publication types in EEBO-TCP phase 1
library(ggbeeswarm)
bind_rows(
df %>% mutate(group = "Editions"),
df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
filter(type %in% c("Book","Pamphlet")) %>%
group_by(publication_year, edition_type, group, type, in_eebo_tcp_phase_1) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp_phase_1) %>%
ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
geom_quasirandom(aes(size = tn), dodge = 1.0) +
stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
theme_hsci_discrete() +
xlab(NULL) +
ylab("EEBO-TCP phase 1 coverage") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
theme(legend.justification = c(0, 1), legend.position = c(0.02, 0.98), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
labs(color = "Representation type", size = "Count") +
guides(shape = "none")

Coverage of different publication types in EEBO-TCP phase 2
library(ggbeeswarm)
bind_rows(
df %>% mutate(group = "Editions"),
df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
filter(type %in% c("Book","Pamphlet")) %>%
group_by(publication_year, edition_type, group, type, in_eebo_tcp_phase_2) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp_phase_2) %>%
ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
geom_quasirandom(aes(size = tn), dodge = 1.0) +
stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
theme_hsci_discrete() +
xlab(NULL) +
ylab("EEBO-TCP phase 2 coverage") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
theme(legend.justification = c(0, 1), legend.position = c(0.02, 0.98), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
labs(color = "Representation type", size = "Count") +
guides(shape = "none")

For coverage in EEBO-TCP, a clear pattern emerges where coverage of
singular and first editions is much better than coverage of later
editions. There are also no clear differences between EEBO-TCP phase 1
and phase 2 in behavior with regard to this. This has an important
bearing for all following analyses, which in the case of EEBO-TCP,
should mostly evaluate coverage on this work-level. As a separate
observation, interestingly, coverage of books and pamphlets also seems
quite even even. Another observation is that EEBO-TCP phase 2 contains
more singular works than phase 1. This may indicate a broader collection
of “non-core” works, instead of focusing on first editions of popular
(and thus later reprinted) works.
Edition-level temporal overview
df %>% mutate(g = case_when(
!certain ~ "Uncertain dating",
in_eebo_tcp_phase_1 ~ "In EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "In EEBO-TCP phase 2",
in_eebo ~ "In EEBO",
T ~ "ESTC total",
)) %>%
ggplot(aes(x = publication_year, fill = fct_relevel(g, "Uncertain dating", "ESTC total", "In EEBO","In EEBO-TCP phase 1","In EEBO-TCP phase 2"))) +
geom_bar(width = 1) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(breaks = seq(0, 10000, by = 500)) +
xlab("Year") +
ylab("ESTC entries") +
theme(legend.justification = c(0, 1), legend.position = c(0.05, 0.95), legend.background = element_blank(), legend.key = element_blank()) +
labs(fill = NULL) +
guides(fill = guide_legend(reverse = TRUE))

In terms of a temporal overview, it is important to note how here in
an absolute graph, the amount of entries grows significantly overall
through time, as well as has large variations and spikes multiple times
between 1640 and 1700 (with the larger bump between 1640 and 1660 most
likely consisting mainly of the Thomason Tracts).
df %>% filter(certain) %>% mutate(g = case_when(
in_eebo_tcp_phase_1 ~ "In EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "In EEBO-TCP phase 2",
in_eebo ~ "In EEBO",
T ~ "Not in EEBO",
)) %>%
ggplot(aes(x = publication_year, fill = fct_relevel(g, "Not in EEBO", "In EEBO","In EEBO-TCP phase 2","In EEBO-TCP phase 1"))) +
geom_bar(width = 1,position='fill') +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1),labels=scales::percent_format(accuracy=1)) +
xlab("Year") +
ylab("Proportion of ESTC entries") +
theme(legend.position="bottom") +
labs(fill = NULL) +
guides(fill = guide_legend(reverse = TRUE))

In terms of edition-level proportional coverage, EEBO coverage is
quite balanced throughout the period, with just a slight drop at the end
of the 17th century. For EEBO-TCP, edition-level coverage is much more
varied, but as noted, it actually does not make that much sense to look
at edition-level coverage with respect to it.
Work-level temporal overview
df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),certain=any(first_year_publication & certain),.groups="drop") %>%
mutate(g = case_when(
!certain ~ "Uncertain dating",
in_eebo_tcp_phase_1 ~ "In EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "In EEBO-TCP phase 2",
in_eebo ~ "In EEBO",
T ~ "ESTC total",
)) %>%
ggplot(aes(x = first_publication_year, fill = fct_relevel(g, "Uncertain dating", "ESTC total", "In EEBO","In EEBO-TCP phase 2","In EEBO-TCP phase 1"))) +
geom_bar(width = 1) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(breaks = seq(0, 10000, by = 500)) +
xlab("Year") +
ylab("ESTC entries") +
theme(legend.justification = c(0, 1), legend.position = c(0.05, 0.95), legend.background = element_blank(), legend.key = element_blank()) +
labs(fill = NULL) +
guides(fill = guide_legend(reverse = TRUE))

df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),certain=any(first_year_publication & certain),.groups="drop") %>% mutate(g = case_when(
in_eebo_tcp_phase_1 ~ "In EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "In EEBO-TCP phase 2",
in_eebo ~ "In EEBO",
T ~ "Not in EEBO",
)) %>%
filter(certain) %>%
ggplot(aes(x = first_publication_year, fill = fct_relevel(g, "Not in EEBO", "In EEBO","In EEBO-TCP phase 2","In EEBO-TCP phase 1"))) +
geom_bar(width = 1,position='fill') +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(breaks = seq(0, 1, by = 0.1),labels=scales::percent_format(accuracy=1)) +
xlab("Year of first publication") +
ylab("Proportion of ESTC works") +
theme(legend.position="bottom") +
labs(fill = NULL) +
guides(fill = guide_legend(reverse = TRUE))

In terms of work-level coverage, also EEBO-TCP appears quite nicely
balanced temporally, apart from dips between 1500 and 1530. However, it
must be noted how the total amount of content is also very low for those
early years, so larger variation can also be expected. The addition of
phase 2 improves the evenness of EEBO-TCP coverage a bit with regard to
phase 1, where coverage diminishes toward the end of the century.
Document type coverage through time
bind_rows(
df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
mutate(group = "Works")
) %>%
mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
filter(certain) %>%
filter(!is.na(type),type!="In-between") %>%
group_by(publication_year, type, in_eebo) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo) %>%
ggplot(aes(x = publication_year, y = prop, color = type)) +
geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))

Drilling in and separating books and pamphlets from each other, we
can see that EEBO coverage of both is very good, apart from a noticeable
drop in pamphlet coverage in the late 17th century.
bind_rows(
df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
mutate(group = "Works")
) %>%
mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
filter(certain) %>%
filter(!is.na(type),type!="In-between") %>%
group_by(publication_year, type, in_eebo_tcp) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp) %>%
ggplot(aes(x = publication_year, y = prop, color = type)) +
geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO-TCP coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))

For EEBO-TCP, on the work level, the same drop in coverage for
pamphlets at the end of the 17th century can be seen, but otherwise
coverage is relatively stable through time for both books as well as
pamphlets, except for a marked uptick between 1640 and 1660 (caused most
likely by more judicious inclusion of the Thomason Tracts). On the work
level, pamphlets are just slightly better covered than books, but on the
on the edition level, coverage of books is much lower. This can be seen
as only the natural consequence of EEBO-TCP favouring including only
first editions. Books typically have more editions than pamphlets, so
excluding later editions affects edition-level coverage for books much
more than it does for pamphlets.
bind_rows(
df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),certain=any(first_year_publication & certain),.groups="drop") %>%
mutate(group = "Works")
) %>%
mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
filter(certain) %>%
filter(!is.na(type),type!="In-between") %>%
group_by(publication_year, type, in_eebo_tcp_phase_1) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp_phase_1) %>%
ggplot(aes(x = publication_year, y = prop, color = type)) +
geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO-TCP phase 1 coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))

Looking only at phase 1, there is a clear bump in the representation
of pamphlets in the 1560s, which interestingly is corrected for when
taking in also phase 2. In terms of book coverage, there is also a
linear decline in coverage between about 1560 and 1650 (before 1540 the
data is so sparse that reliable conclusions cannot be drawn from
it).
Topical coverage EEBO-TCP vs EEBO
EEBO work-level genre use frequencies
(subset that is in ESTC to get the work information)
eebo_ustc_genres %>%
inner_join(eebo_core,by=c("eebo_id")) %>%
inner_join(estc_core,by=c("estc_id")) %>%
mutate(ustc_genre=str_trunc(ustc_genre,65),status=case_when(
in_eebo_tcp_phase_1 ~ "EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "EEBO-TCP phase 2",
T ~ "Not in EEBO-TCP")) %>%
group_by(ustc_genre,status) %>%
summarize(n=n_distinct(work_id),.groups="drop") %>%
group_by(ustc_genre) %>%
mutate(tn=sum(n)) %>%
ungroup() %>%
mutate(ustc_genre=fct_reorder(ustc_genre,tn)) %>%
ggplot(aes(x=ustc_genre,y=n,fill=status)) +
geom_col() +
theme_hsci_discrete() +
theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(fill=NULL) +
xlab("USTC genre") +
ylab("Number of works") +
scale_y_continuous(labels=scales::number) +
coord_flip()

Open question: are the USTC categories usable? Is this a believable
genre distribution? If it is, the below graphs show interesting
difference and temporal shifts in the coverage of the various
categories, the interpretation of which I leave up to you.
EEBO-TCP work-level genre coverage
eebo_ustc_genres %>%
inner_join(eebo_core,by=c("eebo_id")) %>%
inner_join(estc_core,by=c("estc_id")) %>%
mutate(ustc_genre=str_trunc(ustc_genre,65)) %>%
group_by(work_id,ustc_genre) %>%
summarize(in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
group_by(ustc_genre) %>%
summarize(n=n(),`EEBO-TCP`=sum(in_eebo_tcp)/n(),`EEBO-TCP phase 1`=sum(in_eebo_tcp_phase_1)/n(),`EEBO-TCP phase 2`=sum(in_eebo_tcp_phase_2)/n(),.groups="drop") %>%
mutate(ustc_genre=fct_reorder(str_c(ustc_genre,' (',n,')'),n)) %>%
pivot_longer(`EEBO-TCP phase 1`:`EEBO-TCP phase 2`,names_to="part", values_to = "prop") %>%
ggplot(aes(x=ustc_genre,y=prop,fill=fct_relevel(part,'EEBO-TCP phase 2'))) +
geom_col(position='stack') +
theme_hsci_discrete() +
scale_y_continuous(labels=scales::percent_format(accuracy=1)) +
xlab("USTC genre") +
ylab("Coverage by work") +
theme(legend.position = "bottom") +
# theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(fill=NULL) +
coord_flip()

Here, we can first see which categories have been excluded from
EEBO-TCP: almanacs,academic dissertations, astrology and cosmography, as
well as dictionaries. Apart from this, we also see how poetry and drama
are heavily overemphasized in EEBO-TCP phase 1, whereas phase 2 corrects
nicely for these as well as other imbalances. What remains interesting
is a low coverage of dialectics and rhetoric, linguistics and philology
and classical authors.
EEBO-TCP phase genre composition comparison
EEBO work-level language frequencies
eebo_core %>%
inner_join(estc_core,by=c("estc_id")) %>%
mutate(status=case_when(
in_eebo_tcp_phase_1 ~ "EEBO-TCP phase 1",
in_eebo_tcp_phase_2 ~ "EEBO-TCP phase 2",
T ~ "Not in EEBO-TCP")) %>%
group_by(eebo_tls_language,status) %>%
summarize(n=n_distinct(work_id),.groups="drop") %>%
group_by(eebo_tls_language) %>%
mutate(tn=sum(n)) %>%
ungroup() %>%
mutate(eebo_tls_language=fct_reorder(eebo_tls_language,tn)) %>%
ggplot(aes(x=eebo_tls_language,y=n,fill=status)) +
geom_col() +
theme_hsci_discrete() +
theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(fill=NULL) +
xlab("Language") +
ylab("Number of works (log10)") +
scale_y_continuous(labels=scales::number,trans="log10") +
coord_flip()

EEBO-TCP work-level language coverage vs EEBO
eebo_core %>%
filter(!is.na(eebo_tls_language)) %>%
inner_join(estc_core,by=c("estc_id")) %>%
group_by(work_id,eebo_tls_language) %>%
summarize(in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
group_by(eebo_tls_language) %>%
summarize(n=n(),`EEBO-TCP`=sum(in_eebo_tcp)/n(),`EEBO-TCP phase 1`=sum(in_eebo_tcp_phase_1)/n(),`EEBO-TCP phase 2`=sum(in_eebo_tcp_phase_2)/n(),.groups="drop") %>%
mutate(eebo_tls_language=fct_reorder(str_c(eebo_tls_language,' (',n,')'),n)) %>%
pivot_longer(`EEBO-TCP phase 1`:`EEBO-TCP phase 2`,names_to="part", values_to = "prop") %>%
ggplot(aes(x=eebo_tls_language,y=prop,fill=part)) +
geom_col(position='stack') +
theme_hsci_discrete() +
scale_y_continuous(labels=scales::percent_format(accuracy=1)) +
xlab("Language") +
ylab("Coverage by work as compared to EEBO") +
theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(fill=NULL) +
coord_flip()

Welsh and Scottish are very well covered. Of the major languages,
Latin in particular is very poorly covered overall, and particularly in
phase 2 (which we already knew from the background info at https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/).
French fares a bit better, but not too great.
EEBO-TCP phase language composition comparison (against EEBO,
English excluded)
EEBO-TCP phase edition type composition comparison
Genre coverage through time (EEBO-TCP against EEBO)

In many major genre categories such as religious, literature and
history and chronicles, phase 1 of EEBO-TCP shows a clearly diminishing
coverage toward the end of the century. However, when phase 2 is added
to the data, in addition to significantly improving coverage overall,
this bias disappears.
Topical coverage of EEBO vs ESTC through time
Here, we are projecting subject category information from EEBO/ECCO
throughout the whole of the ESTC in order to compare their coverage. For
the 18th century and ECCO, this seemed to work relatively well for all
the 8 categories. For USTC/EEBO, I was comfortable including only the
religious/history and chronicles and economics -categories.
Using projected ECCO modules
df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
filter(certain) %>%
left_join(combined_projected_ecco_modules,by=c("work_id")) %>%
replace_na(list(projected_ecco_module="Other/Unknown")) %>%
mutate(projected_ecco_module=fct_relevel(projected_ecco_module,"Other/Unknown")) %>%
group_by(first_publication_year, projected_ecco_module, in_eebo) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo) %>%
ggplot(aes(x = first_publication_year, y = prop, color = projected_ecco_module)) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
geom_smooth(aes(weight = n, fill = projected_ecco_module), span = 0.3, method='loess',formula=y~x) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 40)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
coord_cartesian(ylim=c(0.5,1)) +
scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) +
guides(color="none",fill="none") +
facet_wrap(~projected_ecco_module,ncol=3)

Using projected USTC Religious/History and chronicles/Economics
df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
filter(certain) %>%
left_join(estc_projected_ustc_genres %>%
filter(max_prop>=0.7,projected_ustc_genre %in% c("Religious","History and chronicles","Economics")),by=c("work_id")) %>%
replace_na(list(projected_ustc_genre="Other/Unknown")) %>%
mutate(projected_ustc_genre=fct_relevel(projected_ustc_genre,"Other/Unknown")) %>%
group_by(first_publication_year, projected_ustc_genre, in_eebo) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo) %>%
ggplot(aes(x = first_publication_year, y = prop, color = projected_ustc_genre)) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
geom_smooth(aes(weight = n, fill = projected_ustc_genre), span = 0.3, method='loess',formula=y~x) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
coord_cartesian(ylim=c(0,1)) +
scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0))

Topical coverage of EEBO-TCP vs ESTC through time
Using projected ECCO modules
df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
filter(certain) %>%
left_join(combined_projected_ecco_modules,by=c("work_id")) %>%
replace_na(list(projected_ecco_module="Other/Unknown")) %>%
mutate(projected_ecco_module=fct_relevel(projected_ecco_module,"Other/Unknown")) %>%
group_by(first_publication_year, projected_ecco_module, in_eebo_tcp) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp) %>%
ggplot(aes(x = first_publication_year, y = prop, color = projected_ecco_module)) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
geom_smooth(aes(weight = n, fill = projected_ecco_module), span = 0.3, method='loess',formula=y~x) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 40)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO-TCP coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
coord_cartesian(ylim=c(0,1)) +
scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) +
guides(color="none",fill="none") +
facet_wrap(~projected_ecco_module,ncol=3)

Using projected USTC Religious/History and chronicles/Economics
df %>%
filter(first_publication_year>1474) %>%
group_by(work_id,first_publication_year) %>%
summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
filter(certain) %>%
left_join(combined_projected_ustc_genres %>%
filter(projected_ustc_genre %in% c("Religious","History and chronicles","Economics")),by=c("work_id")) %>%
replace_na(list(projected_ustc_genre="Other/Unknown")) %>%
mutate(projected_ustc_genre=fct_relevel(projected_ustc_genre,"Other/Unknown")) %>%
group_by(first_publication_year, projected_ustc_genre, in_eebo_tcp) %>%
tally() %>%
mutate(prop = n / sum(n), tn = sum(n)) %>%
filter(in_eebo_tcp) %>%
ggplot(aes(x = first_publication_year, y = prop, color = projected_ustc_genre)) +
geom_point(color = "gray", shape = 21, aes(size = tn)) +
geom_point(aes(size = n)) +
geom_smooth(aes(weight = n, fill = projected_ustc_genre), span = 0.3, method='loess',formula=y~x) +
theme_hsci_discrete() +
scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
xlab("Year") +
ylab("EEBO-TCP coverage") +
theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
coord_cartesian(ylim=c(0,1)) +
scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0))

(compare this with the raw EEBO-TCP vs EEBO coverage as well as the
ECCO module coverage graphs)
---
title: "EEBO/ESTC analysis"
output: 
  html_notebook: 
    code_folding: hide
    toc: yes
---

```{r setup,echo=F}
knitr::opts_knit$set(root.dir = here::here())
```

```{r,include=F}
source(here::here("code/load_and_prepare_data.R"), local = knitr::knit_global())
```

```{r,include=F}
library(tidyverse)
pak::pkg_install("hsci-r/gghsci")
library(gghsci)
```

```{r,include=F}
p <- function(number) {
  return(format(number, scientific = FALSE, big.mark = ","))
}
pp <- function(percentage,accuracy=0.01) {
  return(scales::percent(percentage, accuracy = accuracy))
}
```

```{r,include=F}
library(assertthat)

n_eebo_ids <- eebo_core %>%
  distinct(eebo_id) %>%
  nrow()
n_eebo_tcp_ids <- eebo_tcp_core %>%
  distinct(eebo_tcp_id) %>%
  nrow()
n_eebo_tcp_ids_phase_1 <- eebo_tcp_core %>%
  filter(eebo_tcp_phase=="EEBO-TCP phase 1") %>%
  distinct(eebo_tcp_id) %>%
  nrow()
n_eebo_tcp_ids_phase_2 <- eebo_tcp_core %>%
  filter(eebo_tcp_phase=="EEBO-TCP phase 2") %>%
  distinct(eebo_tcp_id) %>%
  nrow()
  
n_eebo_ids_in_eebo_tcp <- eebo_core %>%
  filter(!is.na(eebo_tcp_id)) %>%
  distinct(eebo_id) %>%
  nrow()
n_eebo_ids_in_eebo_tcp_phase_1 <- eebo_tcp_core %>% 
  filter(eebo_tcp_phase=="EEBO-TCP phase 1") %>%
  distinct(eebo_id) %>%
  nrow()
n_eebo_ids_in_eebo_tcp_phase_2 <- eebo_tcp_core %>% 
  filter(eebo_tcp_phase=="EEBO-TCP phase 2") %>%
  distinct(eebo_id) %>%
  nrow()

assert_that(eebo_core %>% filter(!is.na(eebo_tcp_id)) %>% distinct(eebo_id, eebo_tcp_id) %>% count(eebo_tcp_id) %>% filter(n > 1) %>% nrow() == 0)

n_eebo_ids_multimapped_to_eebo_tcp <- eebo_core %>%
  filter(!is.na(eebo_tcp_id)) %>%
  distinct(eebo_id, eebo_tcp_id) %>%
  count(eebo_id) %>%
  filter(n > 1) %>%
  nrow()

n_eebo_ids_not_in_estc <- eebo_core %>%
  filter(is.na(estc_id)) %>%
  distinct(eebo_id) %>%
  nrow()
n_eebo_tcp_ids_not_in_estc <- eebo_tcp_core %>%
  filter(is.na(estc_id)) %>%
  distinct(eebo_tcp_id) %>%
  nrow()

n_eebo_ids_multimapped_to_estc <- eebo_core %>%
  filter(!is.na(estc_id)) %>%
  distinct(eebo_id, estc_id) %>%
  count(eebo_id) %>%
  filter(n > 1) %>%
  nrow()
n_eebo_tcp_ids_multimapped_to_estc <- eebo_tcp_core %>%
  filter(!is.na(estc_id)) %>%
  distinct(eebo_tcp_id, estc_id) %>%
  count(eebo_tcp_id) %>%
  filter(n > 1) %>%
  nrow()

n_estc_ids_with_eebo_ids <- estc_core %>%
  filter(in_eebo) %>%
  nrow()
n_estc_ids_in_df_with_eebo_ids <- df %>%
  filter(in_eebo) %>%
  nrow()
n_estc_ids_with_eebo_tcp_ids <- estc_core %>%
  filter(in_eebo_tcp) %>%
  nrow()
n_estc_ids_in_df_with_eebo_tcp_ids <- df %>%
  filter(in_eebo_tcp) %>%
  nrow()

n_estc_ids_in_df <- df %>% nrow()
n_eebo_ids_in_df <- df %>%
  inner_join(eebo_core, by = c("estc_id")) %>%
  distinct(eebo_id) %>%
  nrow()
n_eebo_tcp_ids_in_df <- df %>%
  inner_join(eebo_tcp_core, by = c("estc_id")) %>%
  distinct(eebo_tcp_id) %>%
  nrow()
```

# Overview

Out of a total of `r p(n_eebo_ids)` EEBO records, `r p(n_eebo_ids_in_eebo_tcp)` (`r pp(n_eebo_ids_in_eebo_tcp/n_eebo_ids)`) are in EEBO-TCP (but `r p(n_eebo_ids_multimapped_to_eebo_tcp)` EEBO records have multiple TCP ids).

The `r p(n_eebo_tcp_ids)` EEBO-TCP records are divided into phase 1 and phase 2. In detail, phase 1 contains `r p(n_eebo_tcp_ids_phase_1)` of these records (`r pp(n_eebo_tcp_ids_phase_1/n_eebo_tcp_ids)`) while phase 2 contains `r p(n_eebo_tcp_ids_phase_2)` of these records (`r pp(n_eebo_tcp_ids_phase_2/n_eebo_tcp_ids)`). In terms of EEBO, `r p(n_eebo_ids_in_eebo_tcp_phase_1)` records (`r pp(n_eebo_ids_in_eebo_tcp_phase_1/n_eebo_ids)`) are in EEBO-TCP phase 1, while `r p(n_eebo_ids_in_eebo_tcp_phase_2)` records (`r pp(n_eebo_ids_in_eebo_tcp_phase_2/n_eebo_ids)`) are in EEBO-TCP phase 2.

In terms of the ESTC, out of the `r p(n_eebo_ids)` EEBO records, `r p(n_eebo_ids_not_in_estc)`  (`r pp(n_eebo_ids_not_in_estc/n_eebo_ids)`) could not be matched to an ESTC record and will be left out of the analysis. On the other hand, `r p(n_eebo_ids_multimapped_to_estc)` EEBO records (`r pp(n_eebo_ids_multimapped_to_estc/n_eebo_ids)`) were matched to more than one ESTC record, possibly causing bias.

Out of the `r p(n_eebo_tcp_ids)` EEBO-TCP records, `r p(n_eebo_tcp_ids_not_in_estc)` (`r pp(n_eebo_tcp_ids_not_in_estc/n_eebo_tcp_ids)`) could not be matched to an ESTC record and will be left out of the analysis. On the other hand, `r p(n_eebo_tcp_ids_multimapped_to_estc)` EEBO-TCP records (`r pp(n_eebo_tcp_ids_multimapped_to_estc/n_eebo_tcp_ids)`) were matched to more than one ESTC record, possibly causing bias.

In the analysis, only ESTC records with publication years in the range [1474,1700) have been included. This results in the exclusion of `r p(n_estc_ids_with_eebo_ids-n_estc_ids_in_df_with_eebo_ids)` (`r pp((n_estc_ids_with_eebo_ids-n_estc_ids_in_df_with_eebo_ids)/n_estc_ids_with_eebo_ids)`) ESTC records that have representation in EEBO, possibly causing bias. `r p(n_estc_ids_with_eebo_tcp_ids-n_estc_ids_in_df_with_eebo_tcp_ids)` (`r pp((n_estc_ids_with_eebo_tcp_ids-n_estc_ids_in_df_with_eebo_tcp_ids)/n_estc_ids_with_eebo_tcp_ids)`) of the ESTC records with representation in EEBO-TCP are removed due to this filtering condition.

In the end, our working dataset:

* From the viewpoint of EEBO, contains `r p(n_eebo_ids_in_df)` (`r pp(n_eebo_ids_in_df/n_eebo_ids)`) out of the original `r p(n_eebo_ids)` EEBO ids. 
* From the viewpoint of EEBO-TCP, contains `r p(n_eebo_tcp_ids_in_df)` (`r pp(n_eebo_tcp_ids_in_df/n_eebo_tcp_ids)`) out of the original `r p(n_eebo_tcp_ids)` EEBO ids. 
* Consists of `r p(n_estc_ids_in_df)` ESTC records, of which `r p(n_estc_ids_in_df_with_eebo_ids)` (`r pp(n_estc_ids_in_df_with_eebo_ids/n_estc_ids_in_df)`) we estimate to have representation in EEBO, and `r p(n_estc_ids_in_df_with_eebo_tcp_ids)` (`r pp(n_estc_ids_in_df_with_eebo_tcp_ids/n_estc_ids_in_df)`) to have representation in EEBO-TCP.

# Publication type analysis

## Coverage of different publication types in EEBO

```{r,fig.width=7}
library(ggbeeswarm)
bind_rows(
  df %>% mutate(group = "Editions"),
  df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
    mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
  mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
  filter(type %in% c("Book","Pamphlet")) %>%
  group_by(publication_year, edition_type, group, type, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
  geom_quasirandom(aes(size = tn), dodge = 1.0) +
  stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
  theme_hsci_discrete() +
  xlab(NULL) +
  ylab("EEBO coverage") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
  theme(legend.justification = c(0, 0), legend.position = c(0.02, 0.02), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = "Representation type", size = "Count") +
  guides(shape = "none")
```

In terms of coverage of ESTC's pre-18th-century material, EEBO is quite good, with a median coverage of about 95% of books both at the edition as well as the work-level, with only a slight drop in coverage for later year editions (meaning that even for later editions, EEBO often contains at least one edition from each year, but may not contain all distinct printings from that year).

For pamphlets, coverage is about 85% across the board, with an interesting increase for later year editions (this may be caused either by reprinted pamphlets having been though of as important to capture, or due to e.g. temporal artifacts, even though it does not appear that overall coverage improves with time, as seen later).

## Coverage of different publication types in EEBO-TCP

```{r,fig.width=7}
library(ggbeeswarm)
bind_rows(
  df %>% mutate(group = "Editions"),
  df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
    mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
  mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
  filter(type %in% c("Book","Pamphlet")) %>%
  group_by(publication_year, edition_type, group, type, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
  geom_quasirandom(aes(size = tn), dodge = 1.0) +
  stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
  theme_hsci_discrete() +
  xlab(NULL) +
  ylab("EEBO-TCP coverage") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
  theme(legend.justification = c(0, 1), legend.position = c(0.02, 0.98), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = "Representation type", size = "Count") +
  guides(shape = "none")
```

### Coverage of different publication types in EEBO-TCP phase 1

```{r,fig.width=7}
library(ggbeeswarm)
bind_rows(
  df %>% mutate(group = "Editions"),
  df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
    mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
  mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
  filter(type %in% c("Book","Pamphlet")) %>%
  group_by(publication_year, edition_type, group, type, in_eebo_tcp_phase_1) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp_phase_1) %>%
  ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
  geom_quasirandom(aes(size = tn), dodge = 1.0) +
  stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
  theme_hsci_discrete() +
  xlab(NULL) +
  ylab("EEBO-TCP phase 1 coverage") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
  theme(legend.justification = c(0, 1), legend.position = c(0.02, 0.98), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = "Representation type", size = "Count") +
  guides(shape = "none")
```

### Coverage of different publication types in EEBO-TCP phase 2

```{r,fig.width=7}
library(ggbeeswarm)
bind_rows(
  df %>% mutate(group = "Editions"),
  df %>% filter(edition_type!="Singular") %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
    mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
  mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
  filter(type %in% c("Book","Pamphlet")) %>%
  group_by(publication_year, edition_type, group, type, in_eebo_tcp_phase_2) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp_phase_2) %>%
  ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
  geom_quasirandom(aes(size = tn), dodge = 1.0) +
  stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
  theme_hsci_discrete() +
  xlab(NULL) +
  ylab("EEBO-TCP phase 2 coverage") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
  theme(legend.justification = c(0, 1), legend.position = c(0.02, 0.98), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = "Representation type", size = "Count") +
  guides(shape = "none")
```
For coverage in EEBO-TCP, a clear pattern emerges where coverage of singular and first editions is much better than coverage of later editions. There are also no clear differences between EEBO-TCP phase 1 and phase 2 in behavior with regard to this. This has an important bearing for all following analyses, which in the case of EEBO-TCP, should mostly evaluate coverage on this work-level. As a separate observation, interestingly, coverage of books and pamphlets also seems quite even even. Another observation is that EEBO-TCP phase 2 contains more singular works than phase 1. This may indicate a broader collection of "non-core" works, instead of focusing on first editions of popular (and thus later reprinted) works.

# Edition-level temporal overview

```{r,fig.width = 6, fig.height = 3}
df %>%  mutate(g = case_when(
  !certain ~ "Uncertain dating",
  in_eebo_tcp_phase_1  ~ "In EEBO-TCP phase 1",
  in_eebo_tcp_phase_2  ~ "In EEBO-TCP phase 2",
  in_eebo ~ "In EEBO",
  T ~ "ESTC total",
)) %>%
  ggplot(aes(x = publication_year, fill = fct_relevel(g, "Uncertain dating", "ESTC total", "In EEBO","In EEBO-TCP phase 1","In EEBO-TCP phase 2"))) +
  geom_bar(width = 1) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 10000, by = 500)) +
  xlab("Year") +
  ylab("ESTC entries") +
  theme(legend.justification = c(0, 1), legend.position = c(0.05, 0.95), legend.background = element_blank(), legend.key = element_blank()) +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))
```
In terms of a temporal overview, it is important to note how here in an absolute graph, the amount of entries grows significantly overall through time, as well as has large variations and spikes multiple times between 1640 and 1700 (with the larger bump between 1640 and 1660 most likely consisting mainly of the Thomason Tracts).

```{r,fig.width = 6, fig.height = 3}
df %>% filter(certain) %>% mutate(g = case_when(
  in_eebo_tcp_phase_1  ~ "In EEBO-TCP phase 1",
  in_eebo_tcp_phase_2  ~ "In EEBO-TCP phase 2",
  in_eebo ~ "In EEBO",
  T ~ "Not in EEBO",
)) %>%
  ggplot(aes(x = publication_year, fill = fct_relevel(g, "Not in EEBO", "In EEBO","In EEBO-TCP phase 2","In EEBO-TCP phase 1"))) +
  geom_bar(width = 1,position='fill') +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1),labels=scales::percent_format(accuracy=1)) +
  xlab("Year") +
  ylab("Proportion of ESTC entries") +
  theme(legend.position="bottom") +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))
```
In terms of edition-level proportional coverage, EEBO coverage is quite balanced throughout the period, with just a slight drop at the end of the 17th century. For EEBO-TCP, edition-level coverage is much more varied, but as noted, it actually does not make that much sense to look at edition-level coverage with respect to it.

# Work-level temporal overview

```{r,fig.width=6,fig.height=3}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),certain=any(first_year_publication & certain),.groups="drop") %>% 
  mutate(g = case_when(
    !certain ~ "Uncertain dating",
    in_eebo_tcp_phase_1  ~ "In EEBO-TCP phase 1",
    in_eebo_tcp_phase_2  ~ "In EEBO-TCP phase 2",
    in_eebo ~ "In EEBO",
    T ~ "ESTC total",
  )) %>% 
  ggplot(aes(x = first_publication_year, fill = fct_relevel(g, "Uncertain dating", "ESTC total", "In EEBO","In EEBO-TCP phase 2","In EEBO-TCP phase 1"))) +
  geom_bar(width = 1) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 10000, by = 500)) +
  xlab("Year") +
  ylab("ESTC entries") +
  theme(legend.justification = c(0, 1), legend.position = c(0.05, 0.95), legend.background = element_blank(), legend.key = element_blank()) +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))
```


```{r,fig.width=6,fig.height=3}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),certain=any(first_year_publication & certain),.groups="drop") %>% mutate(g = case_when(
  in_eebo_tcp_phase_1  ~ "In EEBO-TCP phase 1",
  in_eebo_tcp_phase_2  ~ "In EEBO-TCP phase 2",
  in_eebo ~ "In EEBO",
  T ~ "Not in EEBO",
)) %>% 
  filter(certain) %>%
  ggplot(aes(x = first_publication_year, fill = fct_relevel(g, "Not in EEBO", "In EEBO","In EEBO-TCP phase 2","In EEBO-TCP phase 1"))) +
  geom_bar(width = 1,position='fill') +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1),labels=scales::percent_format(accuracy=1)) +
  xlab("Year of first publication") +
  ylab("Proportion of ESTC works") +
  theme(legend.position="bottom") +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))  
```
In terms of work-level coverage, also EEBO-TCP appears quite nicely balanced temporally, apart from dips between 1500 and 1530. However, it must be noted how the total amount of content is also very low for those early years, so larger variation can also be expected. The addition of phase 2 improves the evenness of EEBO-TCP coverage a bit with regard to phase 1, where coverage diminishes toward the end of the century.

# Document type coverage through time

```{r,fig.width=7,fig.height=4}
bind_rows(
  df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
  df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
    mutate(group = "Works")
) %>%
  mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
  filter(certain) %>% 
  filter(!is.na(type),type!="In-between") %>% 
  group_by(publication_year, type, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = publication_year, y = prop, color = type)) +
  geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
  xlab("Year") +
  ylab("EEBO coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))
```
Drilling in and separating books and pamphlets from each other, we can see that EEBO coverage of both is very good, apart from a noticeable drop in pamphlet coverage in the late 17th century.

```{r,fig.width=7,fig.height=4}
bind_rows(
  df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
  df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
    mutate(group = "Works")
) %>%
  mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
  filter(certain) %>% 
  filter(!is.na(type),type!="In-between") %>% 
  group_by(publication_year, type, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = publication_year, y = prop, color = type)) +
  geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))
```
For EEBO-TCP, on the work level, the same drop in coverage for pamphlets at the end of the 17th century can be seen, but otherwise coverage is relatively stable through time for both books as well as pamphlets, except for a marked uptick between 1640 and 1660 (caused most likely by more judicious inclusion of the Thomason Tracts). On the work level, pamphlets are just slightly better covered than books, but on the on the edition level, coverage of books is much lower. This can be seen as only the natural consequence of EEBO-TCP favouring including only first editions. Books typically have more editions than pamphlets, so excluding later editions affects edition-level coverage for books much more than it does for pamphlets.

```{r,fig.width=7,fig.height=4}
bind_rows(
  df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
  df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),certain=any(first_year_publication & certain),.groups="drop") %>%
    mutate(group = "Works")
) %>%
  mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
  filter(certain) %>% 
  filter(!is.na(type),type!="In-between") %>% 
  group_by(publication_year, type, in_eebo_tcp_phase_1) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp_phase_1) %>%
  ggplot(aes(x = publication_year, y = prop, color = type)) +
  geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
  xlab("Year") +
  ylab("EEBO-TCP phase 1 coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))
```

Looking only at phase 1, there is a clear bump in the representation of pamphlets in the 1560s, which interestingly is corrected for when taking in also phase 2. In terms of book coverage, there is also a linear decline in coverage between about 1560 and 1650 (before 1540 the data is so sparse that reliable conclusions cannot be drawn from it).

# Topical coverage EEBO-TCP vs EEBO

## EEBO work-level genre use frequencies

(subset that is in ESTC to get the work information)

```{r,fig.width=7,fig.height=5}
eebo_ustc_genres %>% 
  inner_join(eebo_core,by=c("eebo_id")) %>%
  inner_join(estc_core,by=c("estc_id")) %>% 
  mutate(ustc_genre=str_trunc(ustc_genre,65),status=case_when(
    in_eebo_tcp_phase_1 ~ "EEBO-TCP phase 1",
    in_eebo_tcp_phase_2 ~ "EEBO-TCP phase 2",
    T ~ "Not in EEBO-TCP")) %>%
  group_by(ustc_genre,status) %>%
  summarize(n=n_distinct(work_id),.groups="drop") %>%
  group_by(ustc_genre) %>%
  mutate(tn=sum(n)) %>%
  ungroup() %>%
  mutate(ustc_genre=fct_reorder(ustc_genre,tn)) %>%
  ggplot(aes(x=ustc_genre,y=n,fill=status)) + 
  geom_col() +
  theme_hsci_discrete() +
  theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(fill=NULL) +  
  xlab("USTC genre") +
  ylab("Number of works") +
  scale_y_continuous(labels=scales::number) +
  coord_flip()
```

Open question: are the USTC categories usable? Is this a believable genre distribution? If it is, the below graphs show interesting difference and temporal shifts in the coverage of the various categories, the interpretation of which I leave up to you. 

```{r,include=FALSE}
library(googlesheets4)
set.seed(42)
eebo_ustc_genre_sample <- eebo_core %>% 
  filter(!is.na(proquest_url)) %>%
  distinct(eebo_id,proquest_url) %>%
  left_join(eebo_ustc_genres) %>% 
  group_by(ustc_genre) %>% 
  slice_sample(n=20) %>% 
  ungroup() %>%
  distinct(eebo_id,proquest_url) %>% 
  left_join(eebo_ustc_genres) %>% 
  distinct(eebo_id,proquest_url,ustc_genre) %>%
  group_by(eebo_id,proquest_url) %>% 
  summarize(ustc_genres=str_flatten(ustc_genre,collapse="|"),.groups="drop") %>%
  arrange(ustc_genres) %>%
  mutate(proquest_url=gs4_formula(str_c('=HYPERLINK("',proquest_url,'","',proquest_url,'")')))
#write_sheet(eebo_ustc_genre_sample,ss="1Hq2cva_K5JA5k0s8qHekEgwGBGt6MxCT_3cpVnLwP7w",sheet="eebo_ustc_genre_sample")
#gs4_create("eebo_ustc_genre_sample",sheets=eebo_ustc_genre_sample)
```


## EEBO-TCP work-level genre coverage

```{r,fig.width=7,fig.height=5}
eebo_ustc_genres %>% 
    inner_join(eebo_core,by=c("eebo_id")) %>%
    inner_join(estc_core,by=c("estc_id")) %>%
  mutate(ustc_genre=str_trunc(ustc_genre,65)) %>%
  group_by(work_id,ustc_genre) %>%
  summarize(in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
  group_by(ustc_genre) %>%
  summarize(n=n(),`EEBO-TCP`=sum(in_eebo_tcp)/n(),`EEBO-TCP phase 1`=sum(in_eebo_tcp_phase_1)/n(),`EEBO-TCP phase 2`=sum(in_eebo_tcp_phase_2)/n(),.groups="drop") %>%
  mutate(ustc_genre=fct_reorder(str_c(ustc_genre,' (',n,')'),n)) %>%
  pivot_longer(`EEBO-TCP phase 1`:`EEBO-TCP phase 2`,names_to="part", values_to = "prop") %>%
  ggplot(aes(x=ustc_genre,y=prop,fill=fct_relevel(part,'EEBO-TCP phase 2'))) + 
  geom_col(position='stack') + 
  theme_hsci_discrete() +
  scale_y_continuous(labels=scales::percent_format(accuracy=1)) +
  xlab("USTC genre") +
  ylab("Coverage by work") +
  theme(legend.position = "bottom") +
#  theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(fill=NULL) +
  coord_flip() 
```

Here, we can first see which categories have been excluded from EEBO-TCP: almanacs,academic dissertations, astrology and cosmography, as well as dictionaries. Apart from this, we also see how poetry and drama are heavily overemphasized in EEBO-TCP phase 1, whereas phase 2 corrects nicely for these as well as other imbalances. What remains interesting is a low coverage of dialectics and rhetoric, linguistics and philology and classical authors.

## EEBO-TCP phase genre composition comparison

### TODO: To remove? Does this add any information? Within-dataset proportions are unintuitive to compare between datasets.

```{r,fig.width=8,fig.height=5}
eebo_ustc_genres %>%
  inner_join(eebo_tcp_core,by=c("eebo_id")) %>%
  inner_join(estc_core,by=c("estc_id")) %>%
  mutate(ustc_genre=str_trunc(ustc_genre,65)) %>%
  count(eebo_tcp_phase,ustc_genre) %>%
  group_by(eebo_tcp_phase) %>% mutate(prop=n/sum(n)) %>%
  mutate(ustc_genre=fct_reorder(ustc_genre,prop)) %>%
  ggplot(aes(x=ustc_genre,y=prop,fill=eebo_tcp_phase %>% recode("EEBO-TCP part 1"="EEBO-TCP phase 1","EEBO-TCP part 2"="EEBO-TCP phase 2"))) + 
  geom_col(position='dodge') + 
  theme_hsci_discrete() +
  scale_y_continuous(labels=scales::percent_format(accuracy=1)) +
  xlab("USTC genre") +
  ylab("Proportion of EEBO-TCP phase") +
  theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(fill=NULL) +
  coord_flip() 
```

## EEBO work-level language frequencies

```{r,fig.width=7,fig.height=5}
eebo_core %>%
  inner_join(estc_core,by=c("estc_id")) %>% 
  mutate(status=case_when(
    in_eebo_tcp_phase_1 ~ "EEBO-TCP phase 1",
    in_eebo_tcp_phase_2 ~ "EEBO-TCP phase 2",
    T ~ "Not in EEBO-TCP")) %>%
  group_by(eebo_tls_language,status) %>%
  summarize(n=n_distinct(work_id),.groups="drop") %>%
  group_by(eebo_tls_language) %>%
  mutate(tn=sum(n)) %>%
  ungroup() %>%
  mutate(eebo_tls_language=fct_reorder(eebo_tls_language,tn)) %>%
  ggplot(aes(x=eebo_tls_language,y=n,fill=status)) + 
  geom_col() +
  theme_hsci_discrete() +
  theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(fill=NULL) +  
  xlab("Language") +
  ylab("Number of works (log10)") +
  scale_y_continuous(labels=scales::number,trans="log10") +
  coord_flip()
```

## EEBO-TCP work-level language coverage vs EEBO

```{r,fig.width=7,fig.height=5}
eebo_core %>%
  filter(!is.na(eebo_tls_language)) %>%
  inner_join(estc_core,by=c("estc_id")) %>%
  group_by(work_id,eebo_tls_language) %>%
  summarize(in_eebo_tcp=any(in_eebo_tcp),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp_phase_2=any(in_eebo_tcp_phase_2),.groups="drop") %>%
  group_by(eebo_tls_language) %>%
  summarize(n=n(),`EEBO-TCP`=sum(in_eebo_tcp)/n(),`EEBO-TCP phase 1`=sum(in_eebo_tcp_phase_1)/n(),`EEBO-TCP phase 2`=sum(in_eebo_tcp_phase_2)/n(),.groups="drop") %>%
  mutate(eebo_tls_language=fct_reorder(str_c(eebo_tls_language,' (',n,')'),n)) %>%
  pivot_longer(`EEBO-TCP phase 1`:`EEBO-TCP phase 2`,names_to="part", values_to = "prop") %>%
  ggplot(aes(x=eebo_tls_language,y=prop,fill=part)) + 
  geom_col(position='stack') + 
  theme_hsci_discrete() +
  scale_y_continuous(labels=scales::percent_format(accuracy=1)) +
  xlab("Language") +
  ylab("Coverage by work as compared to EEBO") +
  theme(legend.justification = c(1, 0), legend.box.just = "bottom", legend.position = c(0.98, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(fill=NULL) +
  coord_flip() 
```

Welsh and Scottish are very well covered. Of the major languages, Latin in particular is very poorly covered overall, and particularly in phase 2 (which we already knew from the background info at https://textcreationpartnership.org/tcp-texts/eebo-tcp-early-english-books-online/). French fares a bit better, but not too great.

## EEBO-TCP phase language composition comparison (against EEBO, English excluded)

### TODO: To remove? Does this add any information? Within-dataset proportions are unintuitive to compare between datasets.

```{r,fig.width=7,fig.height=4}
library(ggbreak)
eebo_core %>% 
  left_join(eebo_tcp_core %>% distinct(eebo_id,eebo_tcp_phase),by=c("eebo_id")) %>%
  replace_na(list(eebo_tcp_phase="In EEBO but not in EEBO-TCP")) %>%
  mutate(language=fct_lump_n(eebo_tls_language,7)) %>%
  count(eebo_tcp_phase,language) %>%
  group_by(eebo_tcp_phase) %>%
  mutate(prop=n/sum(n)) %>%
  ungroup() %>%
  filter(language!="English") %>%
  ggplot(aes(x=language,fill=eebo_tcp_phase,y=prop)) +
  scale_y_continuous(breaks=seq(0,1,by=0.005),labels=scales::percent_format(accuracy=0.1)) +
  scale_y_break(c(0.018,0.10)) +
  ylab("Percentage") +
  xlab("Language") +
  geom_col(position='dodge') +
  theme_hsci_discrete() +
  theme(legend.position="bottom") +
  labs(fill=NULL)
```

## EEBO-TCP phase edition type composition comparison

### TODO: To remove? This doesn't seem to add any information to the "Coverage of different publication types" graphs and within-dataset proportions are unintuitive to compare between datasets.

```{r}
bind_rows(
  df %>% filter(edition_type=="Singular") %>% select(work_id,publication_year,edition_type),
  df %>% 
    filter(edition_type!="Singular") %>% 
    group_by(work_id,type,first_publication_year,publication_year) %>%
    mutate(edition_type=if_else(publication_year==first_publication_year,"First year work","Later work")) %>%
    ungroup() %>%
    distinct(work_id,publication_year,edition_type)
) %>%
  mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work")) %>%
  inner_join(estc_core %>% select(estc_id,work_id),by=c("work_id")) %>%
  inner_join(eebo_core,by=c("estc_id")) %>%
  left_join(eebo_tcp_core %>% distinct(eebo_id,eebo_tcp_phase),by=c("eebo_id")) %>%
  replace_na(list(eebo_tcp_phase="In EEBO but not in EEBO-TCP")) %>%
  count(edition_type,eebo_tcp_phase) %>% 
  group_by(eebo_tcp_phase) %>%
  mutate(prop=n/sum(n)) %>%
  ungroup() %>%
  ggplot(aes(x=edition_type,fill=eebo_tcp_phase,y=prop)) +
  scale_y_continuous(breaks=seq(0,1,by=0.1),labels=scales::percent_format(accuracy=1)) +
  ylab("Percentage") +
  xlab("Edition type") +
  geom_col(position='dodge') +
  theme_hsci_discrete() +
  theme(legend.position="bottom") +
  labs(fill=NULL)  
```

## Genre coverage through time (EEBO-TCP against EEBO)

```{r,fig.width=7,fig.height=8}
eebo_core %>% 
  inner_join(df,by=c("estc_id")) %>%
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp_phase_1=any(in_eebo_tcp_phase_1),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(eebo_ustc_genres %>% 
              inner_join(eebo_core,by=c("eebo_id")) %>%
              inner_join(estc_core,by=c("estc_id")) %>%
              distinct(work_id,ustc_genre) %>%
              mutate(ustc_genre=fct_lump_n(ustc_genre,10)),
            by=c("work_id")
            ) %>%
  mutate(ustc_genre=fct_explicit_na(ustc_genre,"Unknown")) %>%
  group_by(first_publication_year, ustc_genre) %>% 
  summarize(eebo_tcp_n=sum(in_eebo_tcp),eebo_tcp_prop=sum(in_eebo_tcp)/n(),eebo_tcp_phase_1_n=sum(in_eebo_tcp_phase_1),eebo_tcp_phase_1_prop=sum(in_eebo_tcp_phase_1)/n(),tn=n(),.groups="drop") %>%
  pivot_longer(eebo_tcp_n:eebo_tcp_phase_1_prop,names_to=c("part",".value"),names_pattern="(.*)_(.*)") %>%
  mutate(part=if_else(part=="eebo_tcp_phase_1","EEBO-TCP phase 1","EEBO-TCP")) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = part)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = tn, fill = part), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 40)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) +
  guides(fill="none") +
  facet_wrap(~ustc_genre,ncol=3)
```

In many major genre categories such as religious, literature and history and chronicles, phase 1 of EEBO-TCP shows a clearly diminishing coverage toward the end of the century. However, when phase 2 is added to the data, in addition to significantly improving coverage overall, this bias disappears.

# Topical coverage of EEBO vs ESTC through time

Here, we are projecting subject category information from EEBO/ECCO throughout the whole of the ESTC in order to compare their coverage. For the 18th century and ECCO, this seemed to work relatively well for all the 8 categories. For USTC/EEBO, I was comfortable including only the religious/history and chronicles and economics -categories.

## Using projected ECCO modules

```{r,include=F}
pak::pkg_install("COMHIS/eccor")
library(eccor)
ecco_core <- load_ecco_core()
combined_projected_ecco_modules <- ecco_core %>% 
  inner_join(estc_core,by=c("estc_id")) %>%
  distinct(work_id,projected_ecco_module=ecco_module)

combined_projected_ecco_modules <- combined_projected_ecco_modules %>%
  bind_rows(estc_projected_ecco_modules %>% 
    filter(max_prop>=0.7) %>%
      select(-max_prop) %>%
      anti_join(combined_projected_ecco_modules,by=c("work_id")))
```

```{r,fig.width=7,fig.height=8}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(combined_projected_ecco_modules,by=c("work_id")) %>%
  replace_na(list(projected_ecco_module="Other/Unknown")) %>%
  mutate(projected_ecco_module=fct_relevel(projected_ecco_module,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ecco_module, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ecco_module)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ecco_module), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 40)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
  xlab("Year") +
  ylab("EEBO coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0.5,1)) +  
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) + 
  guides(color="none",fill="none") +
  facet_wrap(~projected_ecco_module,ncol=3)
```

## Using projected USTC Religious/History and chronicles/Economics

```{r,include=F}
combined_projected_ustc_genres <- eebo_ustc_genres %>%
  inner_join(eebo_core,by=c("eebo_id")) %>%
  inner_join(estc_core,by=c("estc_id")) %>%
  distinct(work_id,projected_ustc_genre=ustc_genre)

combined_projected_ustc_genres <- combined_projected_ustc_genres %>%
  bind_rows(estc_projected_ustc_genres %>% 
    filter(max_prop>=0.7) %>%
      select(-max_prop) %>%
      anti_join(combined_projected_ustc_genres,by=c("work_id")))
```

```{r,fig.width=6,fig.height=3}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(estc_projected_ustc_genres %>% 
  filter(max_prop>=0.7,projected_ustc_genre %in% c("Religious","History and chronicles","Economics")),by=c("work_id")) %>%
  replace_na(list(projected_ustc_genre="Other/Unknown")) %>%
  mutate(projected_ustc_genre=fct_relevel(projected_ustc_genre,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ustc_genre, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ustc_genre)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ustc_genre), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
  xlab("Year") +
  ylab("EEBO coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0))
```

# Topical coverage of EEBO-TCP vs ESTC through time

## Using projected ECCO modules

```{r,fig.width=7,fig.height=8}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(combined_projected_ecco_modules,by=c("work_id")) %>%
  replace_na(list(projected_ecco_module="Other/Unknown")) %>%
  mutate(projected_ecco_module=fct_relevel(projected_ecco_module,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ecco_module, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ecco_module)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ecco_module), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 40)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) +
  guides(color="none",fill="none") +
  facet_wrap(~projected_ecco_module,ncol=3)
```

## Using projected USTC Religious/History and chronicles/Economics

```{r,fig.width=6,fig.height=3}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(combined_projected_ustc_genres %>% 
  filter(projected_ustc_genre %in% c("Religious","History and chronicles","Economics")),by=c("work_id")) %>%
  replace_na(list(projected_ustc_genre="Other/Unknown")) %>%
  mutate(projected_ustc_genre=fct_relevel(projected_ustc_genre,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ustc_genre, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ustc_genre)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ustc_genre), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.10)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0))
```
(compare this with the raw EEBO-TCP vs EEBO coverage as well as the ECCO module coverage graphs)