Abstract

The participants of the 1000 Genomes Project, totalling 1,074 in number, had their whole genomes sequenced. Observations of their mtDNA were compared to a database created by scraping the most recent build of mtDNA phylogenetic tree from PhyloTree. While the haplogroup N was the most populous of the mtDNA haplogroups for the participants, no single haplogroup made the 50% threshold. When all haplogroups and subclades deriving from N were observed, nearly 60% of participants were accounted for.

Objective

To determine whether half or more of participants in the 1000 Genomes Project have a single mtDNA haplogroup (L0, L1, L2, L3, L4, L5, L6, M, C, Z, D, E, G, Q, N, O, A, S, R, I, W, X, Y, B, F, P, U, HV, K, H, V, J, T).

Terms

mtDNA: Mitochondrial DNA. This is the DNA that is passed down from mother to child, going all the way back through time.
Haplogroup: a group of individuals who share common patrilineal or matrilineal ancestor.
Subclade: sub-haplogroups to further differentiate groups of people.
SNP: Single Nucleotide Polymorphism. A position on a chromosome that has a particular allele in that position.
Allele: any of the alternative forms of a gene that may occur at a given locus.^[1] In the case of human DNA, typically either an A, C, G, or T; each of these is a nucleotide base
- Adenine
- Cytosine
- Guanine
- Thymine

Methods

mtDNA

Every human born (as of writing this) has an mtDNA haplogroup they belong to, passed to them from their biological mother, and passed to their biological mother from her biological mother, passing down through the ages to thousands of years ago. The basic layout of mtDNA haplogroups is explained in the chart below, with each of the haplogroups being displayed.

What is not displayed here is the subclades, which are basically sub-haplogroups. For example, H3 is a subclade of the haplogroup H.

Each haplogroup and, by extension, each subclade, has SNPs that set it apart from each other. Subclades inherit their SNPs from the subclades and haplogroups they derive from; so, too, is true for haplogroups.

To further explore this, information was scraped from the full build of the phylogenetic tree of human mitochondrial DNA. For a full explanation of how this was done, please visit this site. Due to run times, the data was loaded directly from that project into this one.

load(url("https://github.com/gabartomeo/data607-cunysps/blob/master/Final%20Project/Data/mtdna.RData?raw=true"))

This data, brought in, needed only a couple of touch ups to be discussed. First, all of the alleles to be capitalized.

mtdna_df[] <- lapply(mtdna_df, gsub, pattern='^([acgt])$', replacement='\\U\\1', perl=T)

Second, all instances of “CRS” and “rRS” that have gotten into the data frame should be removed.

mtdna_df <- anti_join(mtdna_df, filter(mtdna_df, mutation=="CRS"))

Third and lastly, the data is ungrouped.

mtdna_df <- ungroup(mtdna_df)

Theoretically, every SNP should have the same allele in a subclade that it does in the haplogroup or subclade it derives from. To test this, an exploration of the mtDNA haplogroup E can be performed.

e_df <- data.frame(
  haplogroup = rep("0", length(unique(grep("^E", mtdna_haplogroups, value=TRUE)))),
  E = as.double(rep(rep(0, length(unique(grep("^E", mtdna_haplogroups, value=TRUE))))))
)

e_df$haplogroup <- unique(grep("^E", mtdna_haplogroups, value=TRUE))
e_df$E <- lapply(e_df$haplogroup, function(x) nrow(intersect(select(filter(mtdna_df, haplogroup==x), POS, mutation), select(filter(mtdna_df, haplogroup=="E"), POS, mutation)))/nrow(filter(mtdna_df, haplogroup=="E")))
e_df$E <- unlist(e_df$E)*100
ggplot(e_df, aes(x=haplogroup, y=E, fill=haplogroup)) + geom_bar(stat="identity") + theme(text = element_text(size=20), axis.text.x = element_text(angle = 90, hjust = 1)) + labs(y="% Common with Haplogroup E", x="Haplogroups", fill="Haplogroups")

knitr::kable(e_df, align="c", col.names=c("Haplogroup", "Compared to Haplogroup E"), digits=2)

Haplogroup	Compared to Haplogroup E
E	100.00
E1	100.00
E1a	100.00
E1a1	100.00
E1a1a	100.00
E1a1a1	100.00
E1a1a1a	100.00
E1a1a1b	100.00
E1a1a1b1	100.00
E1a1a1b2	100.00
E1a1a1c	100.00
E1a1b	100.00
E1a1b1	100.00
E1a1b2	100.00
E1a1b3	100.00
E1a1b4	100.00
E1a1c	100.00
E1a2	97.78
E1a2a	97.78
E1a2a1	97.78
E1a2a2	97.78
E1a2a3	97.78
E1a2a4	97.78
E2	97.78
E2a	97.78
E2a1	97.78
E2a1a	97.78
E2a2	97.78
E2b	97.78
E2b1	97.78
E2b2	97.78

As can be observed, there is no guarantee of there being a one-to-one inheritance of SNPs in a subclade of a haplogroup. This can in turn be extended to the top-level haplogroups, as can be seen with the haplogroup N and the haplogroups that derive from it.

n_df <- data.frame(
  haplogroup = rep("0", length(unique(grep("^[NOASRIWXYBFHVJTPUK]$", mtdna_haplogroups, value=TRUE)))),
  N = as.double(rep(rep(0, length(unique(grep("^[NOASRIWXYBFHVJTPUK]$", mtdna_haplogroups, value=TRUE))))))
)

n_df$haplogroup <- unique(grep("^[NOASRIWXYBFHVJTPUK]$", mtdna_haplogroups, value=TRUE))
n_df$N <- lapply(n_df$haplogroup, function(x) nrow(intersect(select(filter(mtdna_df, haplogroup==x), POS, mutation), select(filter(mtdna_df, haplogroup=="N"), POS, mutation)))/nrow(filter(mtdna_df, haplogroup=="N")))
n_df$N <- unlist(n_df$N)*100
ggplot(n_df, aes(x=haplogroup, y=N, fill=haplogroup)) + geom_bar(stat="identity") + theme(text = element_text(size=20)) + labs(y="% Common with Haplogroup N", x="Haplogroups", fill="Haplogroups")

knitr::kable(n_df, align="c", col.names=c("Haplogroup", "Compared to Haplogroup N"), digits=2)

Haplogroup	Compared to Haplogroup N
N	100.00
I	94.74
W	97.37
Y	97.37
A	100.00
O	100.00
S	100.00
X	94.74
R	100.00
V	100.00
H	100.00
J	97.37
T	100.00
F	100.00
P	100.00
U	100.00
K	97.37

There is a great variety in haplogroups and subclades, and the inheritance of SNPs for each is not guaranteed.

Participants

The 1,074 individuals tested by the 1000 Genomes Project had their data brought into the workspace via the vcfR package.

vcf <- read.vcfR(dna_location, verbose = FALSE )
vcf <- vcfR2tidy(vcf)

Three sets of data were within the vcf file, but only one was required for analysis, the one called gt. This data was assigned its own variable with only the columns needed taken - Indiv, POS, and gt_GT_alleles - with the last being renamed to “mutation”. This refers to how SNPs may be of the reference human model, or maybe a different allele at a given position.

indiv_snps <- select(vcf$gt, Indiv, POS, gt_GT_alleles)
names(indiv_snps) <- c(names(indiv_snps)[1:2], "mutation")
indiv_snps$POS <- as.character(indiv_snps$POS)

After that, another data frame was created using the ids of the individuals and columns for each and every single mtDNA haplogroup.

indiv_haplos <- distinct(indiv_snps, Indiv)
indiv_haplos[, mtdna_haplogroups] <- NA

load(url("https://github.com/gabartomeo/data607-cunysps/blob/master/Final%20Project/Data/indivhaplos.RData?raw=true"))

Next comes loading in the RData version of this data frame because the code below, which will produce the same dataset, takes four hours to run on a multicore processor with sixteen gigabytes of memory. It compares each participant’s SNPs to the SNPs required for each haplogroup, and then gives as a percentage how many SNPs the individual has for a given haplogroup.

# Not suggested to run this unless you have four hours to spare
for (each_person in indiv_haplos$Indiv){
  person_df <- filter(indiv_snps, Indiv==each_person)
  person_df <- select(person_df, POS, mutation)
  person_num <- grep(paste("^", each_person, "$", sep=""), indiv_haplos$Indiv)
  for (each_haplo in mtdna_haplogroups){
    haplo_df <- filter(mtdna_df, haplogroup==each_haplo)
    haplo_df <- select(haplo_df, POS, mutation)
    haplo_prob <- nrow(intersect(person_df, haplo_df))/nrow(haplo_df)
    indiv_haplos[[each_haplo]][[person_num]] <- haplo_prob
  }
}

From there, their likely haplogroup was chosen by picking out the column with the max value out of all of the haplogroup columns.

indiv_haplos$haplogroup <- colnames(indiv_haplos[,2:5185])[max.col(indiv_haplos[,2:5185],ties.method="first")]

select(indiv_haplos, Indiv, haplogroup)

Results

sum_haplos <- select(indiv_haplos, haplogroup)
sum_haplos$haplogroup <- gsub("^((L[0-6])|(HV)|[MCZDEGQNOASRIWXYBFPUKHVJT]).*", "\\1", sum_haplos$haplogroup, perl=T)
ggplot(arrange(sum_haplos, haplogroup), aes(haplogroup, fill=haplogroup)) + geom_histogram(stat="count") + theme(text = element_text(size=20), legend.position = "none") + labs(y="Frequency", x="Haplogroups", fill="Haplogroups")

It may be surprising to see so many represented by N, but consider how many haplogroups derive from N - O, A, S, R, I, W, X, Y, B, F, P, U, K, HV, H, V, J, T - and it becomes less surprising.

If instead the numbers of those who are from N or a subclade or haplogroup derived from N are combined, a clearer picture begins to form about the participants.

sum_haplos$haplogroup <- gsub("^((HV)|[NOASRIWXYBFPUHVJTK]).*", "N Derived", sum_haplos$haplogroup)
ggplot(arrange(sum_haplos, haplogroup), aes(haplogroup, fill=haplogroup)) + geom_histogram(stat="count") + theme(text = element_text(size=20), legend.position = "none") + labs(y="Frequency", x="Haplogroups", fill="Haplogroups")

A significant amount of the participants are in the N derived haplogroups. If this is switched instead to percentages it becomes obvious that more than half of the participants are from an N derived haplogroup.

sum_haplos <- as.data.frame(table(select(sum_haplos, haplogroup)))
names(sum_haplos) <- c("haplogroup", "Percent")
sum_haplos$Percent <- (sum_haplos$Percent/1074)*100
ggplot(sum_haplos, aes(x=haplogroup, y=Percent, fill=haplogroup)) + geom_bar(stat="identity") + theme(text = element_text(size=20), legend.position = "none") + labs(y="Percent", x="Haplogroups")

Conclusion

Based on the analysis of the data, the haplogroup N and its derivatives, when combined, passed the 50% threshold and nearly hit 60%. However, none of the 33 individual mtDNA haplogroups met the hypothesized threshold of 50%.

Sources and References

“ALL.chrMT.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf.gz”. 1000 Genomes Project. Web. 29 Apr 2018.
“mtDNA tree Build 17”. PhyloTree. Web. 29 Ap 2018.

“Allele.” Merriam-Webster.com. Merriam-Webster, n.d. Web. 29 Apr 2018.
Phylogenetic tree of human mitochondrial DNA (mtDNA) haplogroups: Human mitochondrial DNA haplogroup on Wikipedia. Accessed 29 Apr 2018.

Final Project DATA 607

Gabrielle Bartomeo

April 29, 2018