The Project
Access to the 1000 Genomes Project provides a host of information on various types of DNA. With Mother’s Day coming up and my interest in DNA, I can’t help be curious what most of the mtDNA haplogroups are of those who were tested in the 1000 Genomes Project. My hypothesis is that more than half of all testers will share in a single mtDNA haplogroup from one of the following: A, B, C, D, E, F, G, H, I, J, K, L1, L2, L4, L5, L6, M, N, O, P, Q, R, S, T, U, V, W, X, Y, Z. DNA has been an active and important part of my life for several years now, even though I’m not a professional in the field at all; my first DNA test was an mtDNA test, and that makes this all the more interesting to me.
Workflow
Packages: tidyr, dplyr, vcfR, ggplot2
- Webscrape the mtDNA tree Build 17 from PhyloTree
- Get the positions and alleles required for each mtDNA haplogroup and subclade
- Create a data frame with all of the mtDNA haplogroups and subclades and the positions and alleles required to qualify for those haplogroups and subclades
- Use the file from 1000 Genomes Project named “ALL.chrMT.phase1_samtools_si.20101123.snps.low_coverage.genotypes.vcf.gz” to gather mtDNA from around 1000 different individuals
- Read the *.VCF file into R using vcfR
- Make the VCF file into a tibble
- Create a function for running the input data frame or tibble against the haplogroup data frame and outputting the haplogroups for each individual, with mismatching.
- Create a plot with the resulting data frame to show where or how the DNA compares, such as histograms