Overview

Out of a total of 132,846 EEBO records, 60,227 (45.34%) are in EEBO-TCP (but 66 EEBO records have multiple TCP ids).

Out of the 132,846 EEBO records, 6,802 (5.12%) could not be matched to an ESTC record and will be left out of the analysis. On the other hand, 7,373 EEBO records (5.55%) were matched to more than one ESTC record, possibly causing bias.

Out of the 60,327 EEBO-TCP records, 1,143 (1.89%) could not be matched to an ESTC record and will be left out of the analysis. On the other hand, 3,269 EEBO-TCP records (5.42%) were matched to more than one ESTC record, possibly causing bias.

In the analysis, only ESTC records with publication years in the range [1474,1700) have been included. This results in the exclusion of 4,862 (4.17%) ESTC records that have representation in EEBO, possibly causing bias. 2,119 (3.41%) of the ESTC records with representation in EEBO-TCP are removed due to this filtering condition.

In the end, our working dataset:

Publication type analysis

Coverage of different publication types in EEBO

library(ggbeeswarm)
bind_rows(
  df %>% mutate(group = "Editions"),
  df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),.groups="drop") %>%
    mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
  mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
  filter(type %in% c("Book","Pamphlet")) %>%
  group_by(publication_year, edition_type, group, type, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
  geom_quasirandom(aes(size = tn), dodge = 1.0) +
  stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
  theme_hsci_discrete() +
  xlab(NULL) +
  ylab("EEBO coverage") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
  theme(legend.justification = c(0, 0), legend.position = c(0.02, 0.02), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = "Representation type", size = "Count") +
  guides(shape = "none")

In terms of coverage of ESTC’s pre-18th-century material, EEBO is quite good, with a median coverage of about 95% of books both at the edition as well as the work-level, with only a slight drop in coverage for later year editions (meaning that even for later editions, EEBO often contains at least one edition from each year, but may not contain all distinct printings from that year).

For pamphlets, coverage is about 85% across the board, with an interesting increase for later year editions (this may be caused either by reprinted pamphlets having been though of as important to capture, or due to e.g. temporal artifacts, even though it does not appear that overall coverage improves with time, as seen later).

Coverage of different publication types in EEBO-TCP

library(ggbeeswarm)
bind_rows(
  df %>% mutate(group = "Editions"),
  df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),.groups="drop") %>%
    mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
  mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
  filter(type %in% c("Book","Pamphlet")) %>%
  group_by(publication_year, edition_type, group, type, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
  geom_quasirandom(aes(size = tn), dodge = 1.0) +
  stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
  theme_hsci_discrete() +
  xlab(NULL) +
  ylab("EEBO-TCP coverage") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
  theme(legend.justification = c(0, 0), legend.position = c(0.02, 0.02), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = "Representation type", size = "Count") +
  guides(shape = "none")

For coverage in EEBO-TCP, a clear pattern emerges where coverage of singular and first editions is much better than coverage of later editions. This has an important bearing for all following analyses, which in the case of EEBO-TCP, should mostly evaluate coverage on this work-level. As a separate observation, interestingly, coverage of books and pamphlets also seems quite even even.

Edition-level temporal overview

df %>%  mutate(g = case_when(
  !certain ~ "Uncertain dating",
  in_eebo_tcp  ~ "In EEBO-TCP",
  in_eebo ~ "In EEBO",
  T ~ "ESTC total",
)) %>%
  ggplot(aes(x = publication_year, fill = fct_relevel(g, "Uncertain dating", "ESTC total", "In EEBO","In EEBO-TCP"))) +
  geom_bar(width = 1) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 10000, by = 1000)) +
  xlab("Year") +
  ylab("ESTC entries") +
  theme(legend.justification = c(0, 1), legend.position = c(0.05, 0.95), legend.background = element_blank(), legend.key = element_blank()) +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))

In terms of a temporal overview, it is important to note how here in an absolute graph, the amount of entries grows significantly overall through time, as well as has large variations and spikes multiple times between 1640 and 1700 (with the larger bump between 1640 and 1660 most likely consisting mainly of the Thomason Tracts).

df %>% filter(certain) %>% mutate(g = case_when(
  in_eebo_tcp  ~ "In EEBO-TCP",
  in_eebo ~ "In EEBO",
  T ~ "Not in EEBO",
)) %>%
  ggplot(aes(x = publication_year, fill = fct_relevel(g, "Not in EEBO", "In EEBO","In EEBO-TCP"))) +
  geom_bar(width = 1,position='fill') +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1),labels=scales::percent_format(accuracy=1)) +
  xlab("Year") +
  ylab("Proportion of ESTC entries") +
  theme(legend.justification = c(1, 0), legend.position = c(0.94, 0.08), legend.key = element_blank()) +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))

In terms of edition-level proportional coverage, EEBO coverage is quite balanced throughout the period, with just a slight drop at the end of the 17th century. For EEBO-TCP, edition-level coverage is much more varied, but as noted, it actually does not make that much sense to look at edition-level coverage with respect to it.

Work-level temporal overview

df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>% 
  mutate(g = case_when(
    !certain ~ "Uncertain dating",
    in_eebo_tcp  ~ "In EEBO-TCP",
    in_eebo ~ "In EEBO",
    T ~ "ESTC total",
  )) %>% 
  ggplot(aes(x = first_publication_year, fill = fct_relevel(g, "Uncertain dating", "ESTC total", "In EEBO","In EEBO-TCP"))) +
  geom_bar(width = 1) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 10000, by = 1000)) +
  xlab("Year") +
  ylab("ESTC entries") +
  theme(legend.justification = c(0, 1), legend.position = c(0.05, 0.95), legend.background = element_blank(), legend.key = element_blank()) +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))

df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),.groups="drop") %>%
  mutate(g = case_when(
  in_eebo_tcp  ~ "In EEBO-TCP",
  in_eebo ~ "In EEBO",
  T ~ "Not in EEBO",
)) %>%  
ggplot(aes(x = first_publication_year, fill = fct_relevel(g, "Not in EEBO", "In EEBO","In EEBO-TCP"))) +
  geom_bar(width = 1,position='fill') +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1),labels=scales::percent_format(accuracy=1)) +
  xlab("Year of first publication") +
  ylab("Proportion of ESTC works") +
  theme(legend.justification = c(1, 0), legend.position = c(0.94, 0.08), legend.key = element_blank()) +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))  

In terms of work-level coverage, also EEBO-TCP appears quite nicely balanced temporally, apart from dips between 1500 and 1530. However, it must be noted how the total amount of content is also very low for those early years, so larger variation can also be expected.

Document type coverage through time

bind_rows(
  df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
  df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
    mutate(group = "Works")
) %>%
  mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
  filter(certain) %>% 
  filter(!is.na(type),type!="In-between") %>% 
  group_by(publication_year, type, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = publication_year, y = prop, color = type)) +
  geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))

Drilling in and separating books and pamphlets from each other, we can see that EEBO coverage of both is very good, apart from a noticeable drop in pamphlet coverage in the late 17th century.

bind_rows(
  df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
  df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
    mutate(group = "Works")
) %>%
  mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
  filter(certain) %>% 
  filter(!is.na(type),type!="In-between") %>% 
  group_by(publication_year, type, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = publication_year, y = prop, color = type)) +
  geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))

For EEBO-TCP, on the work level, the same drop in coverage for pamphlets at the end of the 17th century can be seen, but otherwise coverage is relatively stable through time for both books as well as pamphlets, except for a marked uptick between 1640 and 1660 (caused most likely by more judicious inclusion of the Thomason Tracts). On the work level, pamphlets are just slightly better covered than books, but on the on the edition level, coverage of books is much lower. This can be seen as only the natural consequence of EEBO-TCP favouring including only first editions. Books typically have more editions than pamphlets, so excluding later editions affects edition-level coverage for books much more than it does for pamphlets.

Topical coverage EEBO-TCP vs EEBO

EEBO work-level genre use frequencies

(subset that is in ESTC to get the work information)

eebo_ustc_genres %>% 
  inner_join(eebo_core,by=c("eebo_id")) %>%
  inner_join(estc_core,by=c("estc_id")) %>% 
  group_by(ustc_genre,in_eebo_tcp) %>%
  summarize(n=n_distinct(work_id),.groups="drop") %>%
  group_by(ustc_genre) %>%
  mutate(tn=sum(n)) %>%
  ungroup() %>%
  mutate(ustc_genre=fct_reorder(ustc_genre,tn)) %>%
  ggplot(aes(x=ustc_genre,y=n,fill=in_eebo_tcp)) + 
  geom_col(show.legend=F) + 
  xlab("USTC genre") +
  ylab("Number of works") +
  scale_y_continuous(labels=scales::number) +
  theme_hsci_discrete() +
  coord_flip()

Open question: are the USTC categories usable? Is this a believable genre distribution? If it is, the below graphs show interesting difference and temporal shifts in the coverage of the various categories, the interpretation of which I leave up to you.

EEBO-TCP work-level genre coverage

eebo_ustc_genres %>% 
    inner_join(eebo_core,by=c("eebo_id")) %>%
    inner_join(estc_core,by=c("estc_id")) %>%
  group_by(work_id,ustc_genre) %>%
  summarize(in_eebo_tcp=any(!is.na(eebo_tcp_id)),.groups="drop") %>%
  count(ustc_genre,in_eebo_tcp) %>% 
  group_by(ustc_genre) %>%
  mutate(prop=n/sum(n)) %>%
  ungroup() %>%
  filter(in_eebo_tcp) %>%
  mutate(ustc_genre=fct_reorder(ustc_genre,prop)) %>%
  ggplot(aes(x=ustc_genre,y=prop)) + 
  geom_col() + 
  theme_hsci_discrete() +
  scale_y_continuous(labels=scales::percent_format(accuracy=1)) +
  xlab("USTC genre") +
  ylab("Coverage in EEBO-TCP by work") +
  coord_flip() 

Genre coverage through time

eebo_core %>% 
  inner_join(df,by=c("estc_id")) %>%
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(eebo_ustc_genres %>% 
              inner_join(eebo_core,by=c("eebo_id")) %>%
              inner_join(estc_core,by=c("estc_id")) %>%
              distinct(work_id,ustc_genre) %>%
              mutate(ustc_genre=fct_lump_n(ustc_genre,10)),
            by=c("work_id")
            ) %>%
  mutate(ustc_genre=fct_explicit_na(ustc_genre,"Unknown")) %>%
  group_by(first_publication_year, ustc_genre, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = ustc_genre)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = ustc_genre), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) +
  guides(color="none",fill="none") +
  facet_wrap(~ustc_genre)

Topical coverage of EEBO vs ESTC through time

Here, we are projecting subject category information from EEBO/ECCO throughout the whole of the ESTC in order to compare their coverage. For the 18th century and ECCO, this seemed to work relatively well for all the 8 categories. For USTC/EEBO, I was comfortable including only the religious/history and chronicles and economics -categories.

Using projected ECCO modules

df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(combined_projected_ecco_modules,by=c("work_id")) %>%
  replace_na(list(projected_ecco_module="Other/Unknown")) %>%
  mutate(projected_ecco_module=fct_relevel(projected_ecco_module,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ecco_module, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ecco_module)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ecco_module), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0.5,1)) +  
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) + 
  guides(color="none",fill="none") +
  facet_wrap(~projected_ecco_module)

Using projected USTC Religious/History and chronicles/Economics

df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(estc_projected_ustc_genres %>% 
  filter(max_prop>=0.7,projected_ustc_genre %in% c("Religious","History and chronicles","Economics")),by=c("work_id")) %>%
  replace_na(list(projected_ustc_genre="Other/Unknown")) %>%
  mutate(projected_ustc_genre=fct_relevel(projected_ustc_genre,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ustc_genre, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ustc_genre)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ustc_genre), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0))

Topical coverage of EEBO-TCP vs ESTC through time

Using projected ECCO modules

df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(combined_projected_ecco_modules,by=c("work_id")) %>%
  replace_na(list(projected_ecco_module="Other/Unknown")) %>%
  mutate(projected_ecco_module=fct_relevel(projected_ecco_module,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ecco_module, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ecco_module)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ecco_module), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) +
  guides(color="none",fill="none") +
  facet_wrap(~projected_ecco_module)

Using projected USTC Religious/History and chronicles/Economics

df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(combined_projected_ustc_genres %>% 
  filter(projected_ustc_genre %in% c("Religious","History and chronicles","Economics")),by=c("work_id")) %>%
  replace_na(list(projected_ustc_genre="Other/Unknown")) %>%
  mutate(projected_ustc_genre=fct_relevel(projected_ustc_genre,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ustc_genre, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ustc_genre)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ustc_genre), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0))

(compare this with the raw EEBO-TCP vs EEBO coverage as well as the ECCO module coverage graphs)

---
title: "EEBO/ESTC analysis"
output: 
  html_notebook: 
    code_folding: hide
    toc: yes
---

```{r setup,echo=F}
knitr::opts_knit$set(root.dir = here::here())
```

```{r,include=F}
library(tidyverse)
library(here)
pak::pkg_install("hsci-r/gghsci")
library(gghsci)
```

```{r,include=F}
p <- function(number) {
  return(format(number, scientific = FALSE, big.mark = ","))
}
pp <- function(percentage,accuracy=0.01) {
  return(scales::percent(percentage, accuracy = accuracy))
}
```

```{r,include=F}
source(here("code/load_and_prepare_data.R"), local = knitr::knit_global())
```


```{r,include=F}
library(assertthat)

n_eebo_ids <- eebo_core %>%
  distinct(eebo_id) %>%
  nrow()
n_eebo_tcp_ids <- eebo_tcp_core %>%
  distinct(eebo_tcp_id) %>%
  nrow()
n_eebo_ids_in_eebo_tcp <- eebo_core %>%
  filter(!is.na(eebo_tcp_id)) %>%
  distinct(eebo_id) %>%
  nrow()

assert_that(eebo_core %>% filter(!is.na(eebo_tcp_id)) %>% distinct(eebo_id, eebo_tcp_id) %>% count(eebo_tcp_id) %>% filter(n > 1) %>% nrow() == 0)

n_eebo_ids_multimapped_to_eebo_tcp <- eebo_core %>%
  filter(!is.na(eebo_tcp_id)) %>%
  distinct(eebo_id, eebo_tcp_id) %>%
  count(eebo_id) %>%
  filter(n > 1) %>%
  nrow()

n_eebo_ids_not_in_estc <- eebo_core %>%
  filter(is.na(estc_id)) %>%
  distinct(eebo_id) %>%
  nrow()
n_eebo_tcp_ids_not_in_estc <- eebo_tcp_core %>%
  filter(is.na(estc_id)) %>%
  distinct(eebo_tcp_id) %>%
  nrow()

n_eebo_ids_multimapped_to_estc <- eebo_core %>%
  filter(!is.na(estc_id)) %>%
  distinct(eebo_id, estc_id) %>%
  count(eebo_id) %>%
  filter(n > 1) %>%
  nrow()
n_eebo_tcp_ids_multimapped_to_estc <- eebo_tcp_core %>%
  filter(!is.na(estc_id)) %>%
  distinct(eebo_tcp_id, estc_id) %>%
  count(eebo_tcp_id) %>%
  filter(n > 1) %>%
  nrow()

n_estc_ids_with_eebo_ids <- estc_core %>%
  filter(in_eebo) %>%
  nrow()
n_estc_ids_in_df_with_eebo_ids <- df %>%
  filter(in_eebo) %>%
  nrow()
n_estc_ids_with_eebo_tcp_ids <- estc_core %>%
  filter(in_eebo_tcp) %>%
  nrow()
n_estc_ids_in_df_with_eebo_tcp_ids <- df %>%
  filter(in_eebo_tcp) %>%
  nrow()

n_estc_ids_in_df <- df %>% nrow()
n_eebo_ids_in_df <- df %>%
  inner_join(eebo_core, by = c("estc_id")) %>%
  distinct(eebo_id) %>%
  nrow()
n_eebo_tcp_ids_in_df <- df %>%
  inner_join(eebo_tcp_core, by = c("estc_id")) %>%
  distinct(eebo_tcp_id) %>%
  nrow()
```

# Overview

Out of a total of `r p(n_eebo_ids)` EEBO records, `r p(n_eebo_ids_in_eebo_tcp)` (`r pp(n_eebo_ids_in_eebo_tcp/n_eebo_ids)`) are in EEBO-TCP (but `r p(n_eebo_ids_multimapped_to_eebo_tcp)` EEBO records have multiple TCP ids).

Out of the `r p(n_eebo_ids)` EEBO records, `r p(n_eebo_ids_not_in_estc)`  (`r pp(n_eebo_ids_not_in_estc/n_eebo_ids)`) could not be matched to an ESTC record and will be left out of the analysis. On the other hand, `r p(n_eebo_ids_multimapped_to_estc)` EEBO records (`r pp(n_eebo_ids_multimapped_to_estc/n_eebo_ids)`) were matched to more than one ESTC record, possibly causing bias.

Out of the `r p(n_eebo_tcp_ids)` EEBO-TCP records, `r p(n_eebo_tcp_ids_not_in_estc)` (`r pp(n_eebo_tcp_ids_not_in_estc/n_eebo_tcp_ids)`) could not be matched to an ESTC record and will be left out of the analysis. On the other hand, `r p(n_eebo_tcp_ids_multimapped_to_estc)` EEBO-TCP records (`r pp(n_eebo_tcp_ids_multimapped_to_estc/n_eebo_tcp_ids)`) were matched to more than one ESTC record, possibly causing bias.

In the analysis, only ESTC records with publication years in the range [1474,1700) have been included. This results in the exclusion of `r p(n_estc_ids_with_eebo_ids-n_estc_ids_in_df_with_eebo_ids)` (`r pp((n_estc_ids_with_eebo_ids-n_estc_ids_in_df_with_eebo_ids)/n_estc_ids_with_eebo_ids)`) ESTC records that have representation in EEBO, possibly causing bias. `r p(n_estc_ids_with_eebo_tcp_ids-n_estc_ids_in_df_with_eebo_tcp_ids)` (`r pp((n_estc_ids_with_eebo_tcp_ids-n_estc_ids_in_df_with_eebo_tcp_ids)/n_estc_ids_with_eebo_tcp_ids)`) of the ESTC records with representation in EEBO-TCP are removed due to this filtering condition.

In the end, our working dataset:

* From the viewpoint of EEBO, contains `r p(n_eebo_ids_in_df)` (`r pp(n_eebo_ids_in_df/n_eebo_ids)`) out of the original `r p(n_eebo_ids)` EEBO ids. 
* From the viewpoint of EEBO-TCP, contains `r p(n_eebo_tcp_ids_in_df)` (`r pp(n_eebo_tcp_ids_in_df/n_eebo_tcp_ids)`) out of the original `r p(n_eebo_tcp_ids)` EEBO ids. 
* Consists of `r p(n_estc_ids_in_df)` ESTC records, of which `r p(n_estc_ids_in_df_with_eebo_ids)` (`r pp(n_estc_ids_in_df_with_eebo_ids/n_estc_ids_in_df)`) we estimate to have representation in EEBO, and `r p(n_estc_ids_in_df_with_eebo_tcp_ids)` (`r pp(n_estc_ids_in_df_with_eebo_tcp_ids/n_estc_ids_in_df)`) to have representation in EEBO-TCP.

# Publication type analysis

## Coverage of different publication types in EEBO

```{r}
library(ggbeeswarm)
bind_rows(
  df %>% mutate(group = "Editions"),
  df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),.groups="drop") %>%
    mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
  mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
  filter(type %in% c("Book","Pamphlet")) %>%
  group_by(publication_year, edition_type, group, type, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
  geom_quasirandom(aes(size = tn), dodge = 1.0) +
  stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
  theme_hsci_discrete() +
  xlab(NULL) +
  ylab("EEBO coverage") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
  theme(legend.justification = c(0, 0), legend.position = c(0.02, 0.02), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = "Representation type", size = "Count") +
  guides(shape = "none")
```

In terms of coverage of ESTC's pre-18th-century material, EEBO is quite good, with a median coverage of about 95% of books both at the edition as well as the work-level, with only a slight drop in coverage for later year editions (meaning that even for later editions, EEBO often contains at least one edition from each year, but may not contain all distinct printings from that year).

For pamphlets, coverage is about 85% across the board, with an interesting increase for later year editions (this may be caused either by reprinted pamphlets having been though of as important to capture, or due to e.g. temporal artifacts, even though it does not appear that overall coverage improves with time, as seen later).

## Coverage of different publication types in EEBO-TCP

```{r}
library(ggbeeswarm)
bind_rows(
  df %>% mutate(group = "Editions"),
  df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),.groups="drop") %>%
    mutate(group = "Works",edition_type=if_else(publication_year==first_publication_year,"First year work","Later work"))
) %>%
  mutate(edition_type=fct_relevel(edition_type,"Singular","First year work","Later work","First year edition","Later edition")) %>%
  filter(type %in% c("Book","Pamphlet")) %>%
  group_by(publication_year, edition_type, group, type, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = type, y = prop, group = edition_type, color = edition_type)) +
  geom_quasirandom(aes(size = tn), dodge = 1.0) +
  stat_summary(aes(group = edition_type), position = position_dodge(width = 1.0), fun = median, fun.min = median, fun.max = median, geom = "crossbar", width = 0.5, color = "red") +
  theme_hsci_discrete() +
  xlab(NULL) +
  ylab("EEBO-TCP coverage") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  scale_size(breaks = c(250, 500, 1500), range = c(0.1, 8.0)) +
  theme(legend.justification = c(0, 0), legend.position = c(0.02, 0.02), legend.background = element_blank(), legend.box.just = "bottom", legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = "Representation type", size = "Count") +
  guides(shape = "none")
```

For coverage in EEBO-TCP, a clear pattern emerges where coverage of singular and first editions is much better than coverage of later editions. This has an important bearing for all following analyses, which in the case of EEBO-TCP, should mostly evaluate coverage on this work-level. As a separate observation, interestingly, coverage of books and pamphlets also seems quite even even.

# Edition-level temporal overview

```{r,fig.width = 6, fig.height = 3}
df %>%  mutate(g = case_when(
  !certain ~ "Uncertain dating",
  in_eebo_tcp  ~ "In EEBO-TCP",
  in_eebo ~ "In EEBO",
  T ~ "ESTC total",
)) %>%
  ggplot(aes(x = publication_year, fill = fct_relevel(g, "Uncertain dating", "ESTC total", "In EEBO","In EEBO-TCP"))) +
  geom_bar(width = 1) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 10000, by = 1000)) +
  xlab("Year") +
  ylab("ESTC entries") +
  theme(legend.justification = c(0, 1), legend.position = c(0.05, 0.95), legend.background = element_blank(), legend.key = element_blank()) +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))
```
In terms of a temporal overview, it is important to note how here in an absolute graph, the amount of entries grows significantly overall through time, as well as has large variations and spikes multiple times between 1640 and 1700 (with the larger bump between 1640 and 1660 most likely consisting mainly of the Thomason Tracts).

```{r,fig.width = 6, fig.height = 3}
df %>% filter(certain) %>% mutate(g = case_when(
  in_eebo_tcp  ~ "In EEBO-TCP",
  in_eebo ~ "In EEBO",
  T ~ "Not in EEBO",
)) %>%
  ggplot(aes(x = publication_year, fill = fct_relevel(g, "Not in EEBO", "In EEBO","In EEBO-TCP"))) +
  geom_bar(width = 1,position='fill') +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1),labels=scales::percent_format(accuracy=1)) +
  xlab("Year") +
  ylab("Proportion of ESTC entries") +
  theme(legend.justification = c(1, 0), legend.position = c(0.94, 0.08), legend.key = element_blank()) +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))
```
In terms of edition-level proportional coverage, EEBO coverage is quite balanced throughout the period, with just a slight drop at the end of the 17th century. For EEBO-TCP, edition-level coverage is much more varied, but as noted, it actually does not make that much sense to look at edition-level coverage with respect to it.

# Work-level temporal overview

```{r}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>% 
  mutate(g = case_when(
    !certain ~ "Uncertain dating",
    in_eebo_tcp  ~ "In EEBO-TCP",
    in_eebo ~ "In EEBO",
    T ~ "ESTC total",
  )) %>% 
  ggplot(aes(x = first_publication_year, fill = fct_relevel(g, "Uncertain dating", "ESTC total", "In EEBO","In EEBO-TCP"))) +
  geom_bar(width = 1) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 10000, by = 1000)) +
  xlab("Year") +
  ylab("ESTC entries") +
  theme(legend.justification = c(0, 1), legend.position = c(0.05, 0.95), legend.background = element_blank(), legend.key = element_blank()) +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))
```


```{r}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),.groups="drop") %>%
  mutate(g = case_when(
  in_eebo_tcp  ~ "In EEBO-TCP",
  in_eebo ~ "In EEBO",
  T ~ "Not in EEBO",
)) %>%  
ggplot(aes(x = first_publication_year, fill = fct_relevel(g, "Not in EEBO", "In EEBO","In EEBO-TCP"))) +
  geom_bar(width = 1,position='fill') +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(breaks = seq(0, 1, by = 0.1),labels=scales::percent_format(accuracy=1)) +
  xlab("Year of first publication") +
  ylab("Proportion of ESTC works") +
  theme(legend.justification = c(1, 0), legend.position = c(0.94, 0.08), legend.key = element_blank()) +
  labs(fill = NULL) +
  guides(fill = guide_legend(reverse = TRUE))  
```
In terms of work-level coverage, also EEBO-TCP appears quite nicely balanced temporally, apart from dips between 1500 and 1530. However, it must be noted how the total amount of content is also very low for those early years, so larger variation can also be expected.

# Document type coverage through time

```{r}
bind_rows(
  df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
  df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
    mutate(group = "Works")
) %>%
  mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
  filter(certain) %>% 
  filter(!is.na(type),type!="In-between") %>% 
  group_by(publication_year, type, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = publication_year, y = prop, color = type)) +
  geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))
```
Drilling in and separating books and pamphlets from each other, we can see that EEBO coverage of both is very good, apart from a noticeable drop in pamphlet coverage in the late 17th century.

```{r}
bind_rows(
  df %>% mutate(group = "Editions",type=recode(type,"Book"="Book (edition-level)","Pamphlet"="Pamphlet (edition-level)")),
  df %>% group_by(work_id,type,first_publication_year,publication_year) %>%
    summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
    mutate(group = "Works")
) %>%
  mutate(type=fct_relevel(type,"Pamphlet (edition-level)","Book (edition-level)","Pamphlet","Book")) %>%
  filter(certain) %>% 
  filter(!is.na(type),type!="In-between") %>% 
  group_by(publication_year, type, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = publication_year, y = prop, color = type)) +
  geom_smooth(aes(weight = n, fill = type), span = 0.3, method='loess',formula=y~x) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  scale_size(breaks = c(500, 2000, 3500), range = c(0.1, 8.0))
```
For EEBO-TCP, on the work level, the same drop in coverage for pamphlets at the end of the 17th century can be seen, but otherwise coverage is relatively stable through time for both books as well as pamphlets, except for a marked uptick between 1640 and 1660 (caused most likely by more judicious inclusion of the Thomason Tracts). On the work level, pamphlets are just slightly better covered than books, but on the on the edition level, coverage of books is much lower. This can be seen as only the natural consequence of EEBO-TCP favouring including only first editions. Books typically have more editions than pamphlets, so excluding later editions affects edition-level coverage for books much more than it does for pamphlets.

# Topical coverage EEBO-TCP vs EEBO

## EEBO work-level genre use frequencies

(subset that is in ESTC to get the work information)

```{r}
eebo_ustc_genres %>% 
  inner_join(eebo_core,by=c("eebo_id")) %>%
  inner_join(estc_core,by=c("estc_id")) %>% 
  group_by(ustc_genre,in_eebo_tcp) %>%
  summarize(n=n_distinct(work_id),.groups="drop") %>%
  group_by(ustc_genre) %>%
  mutate(tn=sum(n)) %>%
  ungroup() %>%
  mutate(ustc_genre=fct_reorder(ustc_genre,tn)) %>%
  ggplot(aes(x=ustc_genre,y=n,fill=in_eebo_tcp)) + 
  geom_col(show.legend=F) + 
  xlab("USTC genre") +
  ylab("Number of works") +
  scale_y_continuous(labels=scales::number) +
  theme_hsci_discrete() +
  coord_flip()
```

Open question: are the USTC categories usable? Is this a believable genre distribution? If it is, the below graphs show interesting difference and temporal shifts in the coverage of the various categories, the interpretation of which I leave up to you. 

## EEBO-TCP work-level genre coverage

```{r}
eebo_ustc_genres %>% 
    inner_join(eebo_core,by=c("eebo_id")) %>%
    inner_join(estc_core,by=c("estc_id")) %>%
  group_by(work_id,ustc_genre) %>%
  summarize(in_eebo_tcp=any(!is.na(eebo_tcp_id)),.groups="drop") %>%
  count(ustc_genre,in_eebo_tcp) %>% 
  group_by(ustc_genre) %>%
  mutate(prop=n/sum(n)) %>%
  ungroup() %>%
  filter(in_eebo_tcp) %>%
  mutate(ustc_genre=fct_reorder(ustc_genre,prop)) %>%
  ggplot(aes(x=ustc_genre,y=prop)) + 
  geom_col() + 
  theme_hsci_discrete() +
  scale_y_continuous(labels=scales::percent_format(accuracy=1)) +
  xlab("USTC genre") +
  ylab("Coverage in EEBO-TCP by work") +
  coord_flip() 
```

## Genre coverage through time

```{r,fig.width=12}
eebo_core %>% 
  inner_join(df,by=c("estc_id")) %>%
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(eebo_ustc_genres %>% 
              inner_join(eebo_core,by=c("eebo_id")) %>%
              inner_join(estc_core,by=c("estc_id")) %>%
              distinct(work_id,ustc_genre) %>%
              mutate(ustc_genre=fct_lump_n(ustc_genre,10)),
            by=c("work_id")
            ) %>%
  mutate(ustc_genre=fct_explicit_na(ustc_genre,"Unknown")) %>%
  group_by(first_publication_year, ustc_genre, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = ustc_genre)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = ustc_genre), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) +
  guides(color="none",fill="none") +
  facet_wrap(~ustc_genre)
```

# Topical coverage of EEBO vs ESTC through time

Here, we are projecting subject category information from EEBO/ECCO throughout the whole of the ESTC in order to compare their coverage. For the 18th century and ECCO, this seemed to work relatively well for all the 8 categories. For USTC/EEBO, I was comfortable including only the religious/history and chronicles and economics -categories.

## Using projected ECCO modules

```{r,include=F}
pak::pkg_install("COMHIS/eccor")
library(eccor)
ecco_core <- load_ecco_core()
combined_projected_ecco_modules <- ecco_core %>% 
  inner_join(estc_core,by=c("estc_id")) %>%
  distinct(work_id,projected_ecco_module=ecco_module)

combined_projected_ecco_modules <- combined_projected_ecco_modules %>%
  bind_rows(estc_projected_ecco_modules %>% 
    filter(max_prop>=0.7) %>%
      select(-max_prop) %>%
      anti_join(combined_projected_ecco_modules,by=c("work_id")))
```

```{r,fig.width=12}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(combined_projected_ecco_modules,by=c("work_id")) %>%
  replace_na(list(projected_ecco_module="Other/Unknown")) %>%
  mutate(projected_ecco_module=fct_relevel(projected_ecco_module,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ecco_module, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ecco_module)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ecco_module), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0.5,1)) +  
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) + 
  guides(color="none",fill="none") +
  facet_wrap(~projected_ecco_module)
```

## Using projected USTC Religious/History and chronicles/Economics

```{r,include=F}
pak::pkg_install("COMHIS/eebor")
library(eebor)
eebo_core <- load_eebo_core()
eebo_ustc_genres <- load_eebo_ustc_genres()
combined_projected_ustc_genres <- eebo_ustc_genres %>%
  inner_join(eebo_core,by=c("eebo_id")) %>%
  inner_join(estc_core,by=c("estc_id")) %>%
  distinct(work_id,projected_ustc_genre=ustc_genre)

combined_projected_ustc_genres <- combined_projected_ustc_genres %>%
  bind_rows(estc_projected_ustc_genres %>% 
    filter(max_prop>=0.7) %>%
      select(-max_prop) %>%
      anti_join(combined_projected_ustc_genres,by=c("work_id")))
```

```{r}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(estc_projected_ustc_genres %>% 
  filter(max_prop>=0.7,projected_ustc_genre %in% c("Religious","History and chronicles","Economics")),by=c("work_id")) %>%
  replace_na(list(projected_ustc_genre="Other/Unknown")) %>%
  mutate(projected_ustc_genre=fct_relevel(projected_ustc_genre,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ustc_genre, in_eebo) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ustc_genre)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ustc_genre), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0))
```

# Topical coverage of EEBO-TCP vs ESTC through time

## Using projected ECCO modules

```{r,fig.width=12}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(combined_projected_ecco_modules,by=c("work_id")) %>%
  replace_na(list(projected_ecco_module="Other/Unknown")) %>%
  mutate(projected_ecco_module=fct_relevel(projected_ecco_module,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ecco_module, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ecco_module)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ecco_module), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0)) +
  guides(color="none",fill="none") +
  facet_wrap(~projected_ecco_module)
```

## Using projected USTC Religious/History and chronicles/Economics

```{r}
df %>% 
  filter(first_publication_year>1474) %>%
  group_by(work_id,first_publication_year) %>%
  summarize(in_eebo=any(in_eebo),in_eebo_tcp=any(in_eebo_tcp),certain=any(first_year_publication & certain),.groups="drop") %>%
  filter(certain) %>% 
  left_join(combined_projected_ustc_genres %>% 
  filter(projected_ustc_genre %in% c("Religious","History and chronicles","Economics")),by=c("work_id")) %>%
  replace_na(list(projected_ustc_genre="Other/Unknown")) %>%
  mutate(projected_ustc_genre=fct_relevel(projected_ustc_genre,"Other/Unknown")) %>%
  group_by(first_publication_year, projected_ustc_genre, in_eebo_tcp) %>% 
  tally() %>% 
  mutate(prop = n / sum(n), tn = sum(n)) %>% 
  filter(in_eebo_tcp) %>%
  ggplot(aes(x = first_publication_year, y = prop, color = projected_ustc_genre)) +
  geom_point(color = "gray", shape = 21, aes(size = tn)) +
  geom_point(aes(size = n)) +
  geom_smooth(aes(weight = n, fill = projected_ustc_genre), span = 0.3, method='loess',formula=y~x) +
  theme_hsci_discrete() +
  scale_x_continuous(breaks = seq(1000, 2000, by = 20)) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), breaks = seq(0, 1, by = 0.05)) +
  xlab("Year") +
  ylab("EEBO-TCP coverage") +
  theme(legend.justification = c(0, 0), legend.box.just = "bottom", legend.position = c(0.05, 0.02), legend.background = element_blank(), legend.key = element_blank(), legend.box = "horizontal") +
  labs(color = NULL, size = NULL, shape = NULL, fill = NULL) +
  coord_cartesian(ylim=c(0,1)) +
  scale_size(breaks = c(100, 500, 1000), range = c(0.1, 8.0))
```
(compare this with the raw EEBO-TCP vs EEBO coverage as well as the ECCO module coverage graphs)