Aim

Illustrate the number of human genomes sequenced in major genomics projects since 2003. This plot was part of my Data Science in Context presentation.

Data

Data were obtained from https://www.yourgenome.org/theme/timeline-history-of-genomics/ and allofus.nih.gov/about/program-overview/what-makes-all-us-different.

Since there were only a few data points, I created a dataframe manually.

projects <- data.frame(
  year = c(2003, 2015, 2018, 2028),
  n_genomes = c(0, 3, 5, 6)
)

Scatterplot

ggplot(projects, aes(x = year, y = n_genomes)) +
  geom_point(color = "darkblue", size = 3) +
  annotate("text", x = 2008, y = 0, label = "Human Genome Project") +
  annotate("text", x = 2020, y = 3, label = "1000 Genomes Project") + 
  annotate("text", x = 2012, y = 5, label = "100K Genomes Project (UK)") +
  annotate("text", x = 2028, y = 6.5, label = "All of Us", fontface = 2) +  
  geom_smooth(method = 'lm', se = FALSE, color = "lightgray", linetype = "dashed") +
  xlim(2000, 2030) + 
  scale_y_continuous(breaks = seq(0, 6)) +
  xlab("Completion date (year)") + 
  ylab(bquote(bold(log[10](genomes)))) +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold"),
    axis.text = element_text(size = 12),
    axis.title = element_text(face = "bold"),
    axis.line = element_line(color = "black"),
    panel.background = element_blank()
  )