As an undergraduate, I studied Molecular and Cell Biology with an emphasis in Genetics. I realized that although genetics students are taught how to use search tools such as NCBI BLAST, learning programming languages to deeply engage with analyzing genomic data was not a part of our curriculum. This may also tie in with the larger problem: a gap between those who understand biological theory and those who have the computational/statistical skills and tools to analyze biological data.
Also, as commerical sequencing technologies are becoming more popular and available (with companies such as 23andMe and AncestryDNA), it is important for biologists and everyday consumers to understand how conclusions are drawn from these statistical analyses, as well as how these statistical analyses are performed.
R is a powerful tool for managing, visualizing and analyzing scientific data, and is the first programming language I have been exposed to. The purpose of this project is to use the introductory R knowledge that I have gained from this class to understand how we can use R to analyze genomic data from commercial sequencing services, made available through AncestryDNA.
I used AncestryDNA earlier this year and obtained the raw data for my genome, in the form of a text file. The text file contains data for which nucleotide (A, T, C or G) I have for certain single nucleotide polymorphisms (SNPs) in the human genome. A secondary goal in this project is to utilize this raw data.
Another periphery goal of this project is to further use RMarkdown to present my thoughts and analyses through a combination of text, R code, and the output of R code (tables, visuals, etc.).
Bioconductor is an open-source resource that contains many packages for analyzing biological genomic data. However, as I explored its packages, I realized that the packages are catered towards analyzing data obtained from lab settings - such as data obtained from Affymetrix Array and Agilent RNA. Alternatively, the packages require a level of breaking down genomic data that I currently do not have yet.
On GitHub, I found a package called Genomology, made by the user “castedo”. I explored the 4 simple functions in the package, which is summarized as follows:
Functions | Descriptions |
---|---|
read.dna.web | Read AncestryDNA or 23andMe file |
set.genomology.loci | Set simple (non-overlapping) genome locus (genome) data to be used by functions in genomology package |
idescent.prob | Probability of identical-by-descent |
recombination.prob | Probability of recombination out of identical-by-descent |
I used the “read.dna.web” function and successfully imported my own AncestryDNA file. However, using the “idescent.prob” and “recombination.prob” functions require the raw AncestryDNA genomic data of my family members (e.g. mother, father, siblings) which I do not have. Therefore, I will bookmark this package for when my family obtains their sequencing data - but until then, I searched for another package/ tutorial.
Back to searching on GitHub, I found an AncestryDNA R Workshop from the Computational, Evolutionary and Human Genomics (CEHG) Symposium at Stanford University on March 2, 2016, from the user “kermany”. I think completing this workshop by myself serves well as the basis for Project 2 - as a novice in R, this workshop shows me how to use the packages lattice, latticeExtra, Hmisc, ggplot2, rworldmap,ggmap and mapplots, as well as exploring the free software called ADMIXTURE which allows for fast ancestry estimation.
The introduction of this workshop describes the value of the workshop succinctly:
To date, over 1.4 million people have taken the Ancestry DNA test. This amounts to a data set of 1.1 trillion genotypes across 700,000 SNPs. What are effective strategies and techniques for analyzing and understanding genomic data at this scale? In this short workshop, two computational biologists at Ancestry, Amir Kermany and Peter Carbonetto, demonstrate some simple approaches to investigating and visualizing large-scale genetic data sets, with an emphasis on practical skills that can be applied to research in human genetics.
To prepare for this 2-part workshop, I downloaded the software ADMIXTURE, git, PLINK and installed the lattice, latticeExtra, Hmisc, ggplot2, rworldmap,ggmap and mapplots packages using the install.packages function. I also downloaded the respoitory and extra data specific to this workshop.
The primary aim of this exercise is to learn how to visualize and interpret the results of running ADMIXTURE on a collection of genotypes. Here we will use R entirely for this aim. A secondary objective is to learn about some flexible R packages---lattice and ggplot---for visualizing data.
A caveat to using the software ADMIXTURE is that ADMIXTURE assumes a simple model of genetic transmission, and then adjusts the model paramters to best fit the data – this is for the purpose of making the results easier to interpret. In contrast, principal component analysis (PCA) which is not used here, allows for more flexible statistical analyses but are therefore prone to misinterpretation of data.
Here, for computational efficiency’s sake (not requiring 6 hour run time) the writers of the scripts have set k = 7 and k = 11 during their ADMIXTURE simulations to capture the population structure at two different resolutions.
After running the first script (please see “R Script #1” under Appendix), I obtain the following plots as my output. I explored using both k = 7 and k = 11 ADMIXTURE results.
This script has 4 parameters. Firslty, “k” is the number of ancestral populations, “fam.file” is the file giving sample information (which in this case is 2756 genotypes collected from a project focused on human genetic diversity), “admix.file” is the output from the ADMIXTURE software which gives estimated admixture proportions, and “int” is the quantiles of ADMIXTURE statistics to plot as error bars.
Common ancestry between individuals occur when divergence of ethnicity arises from a single ancestral population. For this plot, the y-axis showcases different ancestral populations from prior knowledge about the ethnic or geographic origins of the samples. The x-axis has the admixture proportions. Generally, this first plot shows the distribution of admixture proportions across a data set for a single ancestral population.
There error bars are not as extensive for k = 11 compared to k = 7, and there is a clear difference in resolution between the two plots. In the k = 7 plot, the
I then run the second R script (please see Appendix) provided by the workshop, and get these box plots as my output. Again, I used both the k = 7 and k = 11 data sets. The box plots show the admixture proportions of the 7 or 11 ancestral populations, with the total of their proportions adding up to 1.0. I changed the scale to make it easier to read the approximate quantities.
Here, the plot summarizes the distribution of admixture proportions across a collection of samples that were assigned the same labels (MXL, Maya, Pima). I could also have easily picked other labels.
Comparing these plots, it is clear that deciding which “k” to use is very controversial and importantly in the field, as it can significantly impact the way we interpret the admixture proportions distribution across a collection of samples.
## Loading required package: survival
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
##
## format.pval, round.POSIXt, trunc.POSIXt, units
I downloaded the software PLINK, and converted genomic data files (provided by the workshop) that were very large and inefficient into “.bed” files.
I then ran the output files from PLINK on ADMIXTURE. The run time took several minutes and the result were output files in the form of “.P” and “.Q”.
Here, I ran the third script (please see appendix) called “Compare Admixture”. What I basically did was to use the ADMIXTURE results we generated from the previous exercise to make ADMIXTURE predictions for new genotype samples.
## Loading sample info from PLINK .fam files.
## Loading admixture estimates.
## Retaining admixture estimates for 100 samples.
## Loading group labels assigned to samples.
## Distribution of largest differences in admixture estimates:
print(quantile(apply(abs(admix1 - admix2),1,max),seq(0,1,0.1)),digits = 2)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0.020 0.027 0.036 0.040 0.042 0.044 0.049 0.055 0.072 0.080 0.085
i <- apply(abs(admix1 - admix2),1,which.max)
print(ggplot(data.frame(x = admix1[cbind(1:n,i)],
y = admix2[cbind(1:n,i)],
label = panel$label),
aes(x,y,col = label,shape = label)) +
geom_abline(intercept = 0,slope = 1,size = 0.5,color = "gray",
linetype = "dashed") +
geom_point() + theme_minimal() +
scale_color_manual(values = clrs) +
scale_shape_manual(values = shapes) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(size = 9),
axis.title = element_text(size = 9),
axis.text = element_text(size = 9),
legend.text = element_text(size = 8),
legend.key.height = unit(9,"points")) +
labs(title = "admixture prop. with largest difference in each sample",
x = "admixture proportion from admix.file1",
y = "admixture proportion from admix.file2"))
I have learned immensely about R, RStudio and RMarkdown from just compiling this project. I learned more in depth about RMarkdown basics, how to use git, how to apply the knowledge on R graphics from Professor Samuel’s lectures, the importance of echo=FALSE and so on.
More importantly, I learned about the workflow behind doing a certain “DIY” analysis. Firstly, I start with using a software to run a simulation or statistical analysis (in this case, using ADMIXTURE). Then, I visualize the results from ADMIXTURE using R scripts. Finally, I use RMarkdown to present my project.
This was a great introductory class to R and I hope to do more projects using R as a powerful tool.
R Script #1
# This R script summarizes the distribution of admixture proportions
# across a data set for a single ancestral population. Here we assume
# that we have a collection of labeled samples, and we compile summary
# statistics for each set of samples with the same label.
#
# Before running this script in R, you need to specify several script
# parameters:
#
# k Index of ancestral population.
# fam.file PLINK .fam file giving sample info.
# labels.file Text file containing group/region labels for samples.
# admix.file ADMIXTURE output giving estimated admixture proportions.
# int Quantiles of admixture statistics to plot as error bars.
#
library(grid)
library(ggplot2)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# SCRIPT PARAMETERS
# -----------------
k <- 1
fam.file <- "~/Desktop/FALL 2017/CLASSES/PH 251D/Project 2/cehg16-workshop-master/data/panel/1kg_hgdp.fam"
labels.file <- "~/Desktop/FALL 2017/CLASSES/PH 251D/Project 2/cehg16-workshop-master/data/panel/1kg_hgdp.lab"
admix.file <- "~/Desktop/FALL 2017/CLASSES/PH 251D/Project 2/cehg16-workshop-master/data/admixture/1kg_hgdp.7.Q"
int <- c(0.05,0.95)
# LOAD SAMPLE DATA FROM .fam FILE
# -------------------------------
cat("Loading sample info from PLINK .fam file.\n")
## Loading sample info from PLINK .fam file.
panel <- read.table(fam.file,stringsAsFactors = FALSE)
names(panel) <- c("fid","iid","pid","mid","sex","pheno")
# LOADING GROUP LABELS
# --------------------
cat("Loading group labels assigned to samples.\n")
## Loading group labels assigned to samples.
labels <- read.table(labels.file,header = TRUE,stringsAsFactors = FALSE)
rownames(labels) <- labels$id
panel <- cbind(panel,data.frame(label = factor(labels[panel$iid,"label"])))
n <- nlevels(panel$label)
rm(labels)
# LOAD ADMIXTURE ESTIMATES
# ------------------------
# Load the n x k admixture proportions matrix, where n is the number
# of samples and k is the number of ancestral populations.
cat("Loading admixture estimates.\n")
## Loading admixture estimates.
admix <- as.matrix(read.table(admix.file))
rownames(admix) <- panel$iid
colnames(admix) <- paste0("K",1:ncol(admix))
# COMPILE ADMIXTURE SUMMARY STATISTICS
# ------------------------------------
# For each group label, compile statistics for the selected ancestral
# population: (1) the number of samples assigned that label; (2) the
# mean admixture proportion, (3) the lower bound on the interval, (4)
# the upper bound on the interval.
cat("Compiling group-level admixture summary statistics.\n")
## Compiling group-level admixture summary statistics.
admix.stats <-
data.frame(n = as.vector(table(panel$label)),
x = tapply(admix[,k],panel$label,mean),
a = tapply(admix[,k],panel$label,function(x) quantile(x,int[1])),
b = tapply(admix[,k],panel$label,function(x) quantile(x,int[2])))
# Order the groups by mean admixture proportion.
admix.stats <- admix.stats[order(admix.stats$x),]
# SUMMARIZE CONTRIBUTIONS FROM SELECTED ANCESTRAL POPULATION
# ----------------------------------------------------------
# Visualize these summary statistics using the error bar plot in
# ggplot2.
dev.new(height = 7,width = 4.5)
yooo = print(ggplot(cbind(admix.stats,y = 1:n),aes(x = x,y = y)) +
geom_point(col = "dodgerblue",cex = 1.4) +
geom_errorbarh(aes(xmin = a,xmax = b),col = "dodgerblue",height = 0) +
theme_minimal() +
scale_x_continuous(limits = c(-0.025,1.025),breaks = c(0,0.5,1)) +
scale_y_continuous(breaks = 1:n,
labels = paste(rownames(admix.stats),"-",
admix.stats$n)) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
axis.text.x = element_text(size = 9),
axis.text.y = element_text(size = 7),
axis.title = element_text(size = 9),
plot.title = element_text(size = 9)) +
labs(title = paste("ancestral population k =",k),
x = "admixture proportion",y = ""))
R Script #2
# This R script summarizes the distribution of admixture proportions
# across a collection of samples selected by their group or region
# label. This script is useful for looking closely at the genetic
# composition of a collection of genotypes that are believed to have
# similar ancestral origins.
#
# To run this script in R, you need to specify several script
# parameters:
#
# groups Select all samples that are assigned these labels.
# fam.file PLINK .fam file giving sample info.
# labels.file Text file containing group/region labels for samples.
# admix.file ADMIXTURE output giving estimated admixture proportions.
#
# By default, this script generates box-and-whisker plots. To create
# box-percentile plots instead, which can sometimes be more
# informative, or even more visually appealing, uncomment the lines of
# code toward the bottom of this script, as indicated below.
#
library(lattice)
library(Hmisc)
# SCRIPT PARAMETERS
# -----------------
groups <- c("MXL","Maya","Pima")
fam.file <- "~/Desktop/FALL 2017/CLASSES/PH 251D/Project 2/cehg16-workshop-master/data/panel/1kg_hgdp.fam"
labels.file <- "~/Desktop/FALL 2017/CLASSES/PH 251D/Project 2/cehg16-workshop-master/data/panel/1kg_hgdp.lab"
admix.file <- "~/Desktop/FALL 2017/CLASSES/PH 251D/Project 2/cehg16-workshop-master/data/admixture/1kg_hgdp.7.Q"
# LOAD SAMPLE DATA FROM .fam FILE
# -------------------------------
cat("Loading sample info from PLINK .fam file.\n")
## Loading sample info from PLINK .fam file.
panel <- read.table(fam.file,stringsAsFactors = FALSE)
names(panel) <- c("fid","iid","pid","mid","sex","pheno")
# LOADING GROUP LABELS
# --------------------
cat("Loading group labels assigned to samples.\n")
## Loading group labels assigned to samples.
labels <- read.table(labels.file,header = TRUE,stringsAsFactors = FALSE)
rownames(labels) <- labels$id
panel <- cbind(panel,data.frame(label = factor(labels[panel$iid,"label"])))
n <- nlevels(panel$label)
rm(labels)
# LOAD ADMIXTURE ESTIMATES
# ------------------------
# Load the n x k admixture proportions matrix, where n is the number
# of samples and k is the number of ancestral populations.
cat("Loading admixture estimates.\n")
## Loading admixture estimates.
admix <- as.matrix(read.table(admix.file))
K <- ncol(admix)
rownames(admix) <- panel$iid
colnames(admix) <- paste0("K",1:K)
# SUMMARIZE ADMIXTURE PROPORTIONS AMONG SELECTED SAMPLES
# ------------------------------------------------------
# Compile and visualize admixture proportions for the selected samples.
cat("Summarizing admixture among selected samples.\n")
## Summarizing admixture among selected samples.
trellis.par.set(par.main.text = list(cex = 0.75,font = 1),
par.xlab.text = list(cex = 0.75),
par.ylab.text = list(cex = 0.75),
axis.text = list(cex = 0.65))
rows <- which(is.element(panel$label,groups))
n <- length(rows)
print(bwplot(y ~ x,
data.frame(y = factor(as.vector(matrix(1:K,n,K,byrow=TRUE)),K:1),
x = as.vector(admix[rows,])),
scales = list(x = list(limits = c(-0.05,1.05),
at = seq(0,1,0.25))),
#
# Uncomment these lines to transform a box-and-whisker plot
# into a box-percentile plot:
#
# panel = panel.bpplot,
# probs = c(0.01,0.02,0.05,0.125,0.25,0.375),
# nloc = "none",
# cex.means = 0.75,
#
xlab = "admixture proportion",
ylab = "ancestral population",
main = paste(groups,collapse = ", ")))
R Script #3
# This R script summarizes the differences in two sets of admixture
# estimates. These are the parameters that need to be set before
# running this script:
#
# labels.file Text file containing group/region labels for samples.
# admix.file1 ADMIXTURE output containing first admixture estimates.
# admix.file2 ADMIXTURE output containing second admixture estimates.
# fam.file1 PLINK .fam file giving sample info for admix.file1.
# fam.file2 PLINK .fam file giving sample info for admix.file2.
#
library(grid)
library(ggplot2)
# SCRIPT PARAMETERS
# -----------------
labels.file <- "~/Desktop/FALL 2017/CLASSES/PH 251D/Project 2/cehg16-workshop-master/data/panel/1kg_hgdp.lab"
fam.file1 <- "~/Desktop/FALL 2017/CLASSES/PH 251D/Project 2/cehg16-workshop-master/data/panel/1kg_hgdp.fam"
fam.file2 <- "~/Desktop/FALL 2017/CLASSES/PH 251D/Project 2/cehg16-workshop-master/data/panel/1kg_hgdp_test.fam"
admix.file1 <- "~/Desktop/FALL 2017/CLASSES/PH 251D/Project 2/cehg16-workshop-master/data/admixture/1kg_hgdp.7.Q"
admix.file2 <- "~/Desktop/FALL 2017/CLASSES/PH 251D/Project 2/cehg16-workshop-master/data/panel/1kg_hgdp_test.7.Q"
# These are the shapes and colours used to plot the results.
clrs <- rep(c("darkorange","dodgerblue","forestgreen","red","navyblue","gold",
"yellowgreen","black","darkviolet","magenta","cyan"),each = 4)
shapes <- rep(c(20,1,2,4),times = 11)
# LOAD SAMPLE DATA FROM .fam FILES
# --------------------------------
cat("Loading sample info from PLINK .fam files.\n")
## Loading sample info from PLINK .fam files.
panel1 <- read.table(fam.file1,stringsAsFactors = FALSE)
panel2 <- read.table(fam.file2,stringsAsFactors = FALSE)
names(panel1) <- c("fid","iid","pid","mid","sex","pheno")
names(panel2) <- c("fid","iid","pid","mid","sex","pheno")
rownames(panel1) <- panel1$iid
rownames(panel2) <- panel2$iid
# LOAD ADMIXTURE ESTIMATES
# ------------------------
cat("Loading admixture estimates.\n")
## Loading admixture estimates.
admix1 <- as.matrix(read.table(admix.file1))
admix2 <- as.matrix(read.table(admix.file2))
K <- ncol(admix1)
rownames(admix1) <- panel1$iid
rownames(admix2) <- panel2$iid
colnames(admix1) <- paste0("K",1:K)
colnames(admix2) <- paste0("K",1:K)
# Retain only the admixture estimates for samples that are found in
# both sets of results.
ids <- intersect(panel1$iid,panel2$iid)
n <- length(ids)
panel <- panel1[ids,]
admix1 <- admix1[ids,]
admix2 <- admix2[ids,]
cat("Retaining admixture estimates for",n,"samples.\n")
## Retaining admixture estimates for 100 samples.
rm(panel1,panel2,ids)
# LOADING GROUP LABELS
# --------------------
cat("Loading group labels assigned to samples.\n")
## Loading group labels assigned to samples.
labels <- read.table(labels.file,header = TRUE,stringsAsFactors = FALSE)
rownames(labels) <- labels$id
panel <- cbind(panel,data.frame(label = factor(labels[panel$iid,"label"])))
rm(labels)
# SUMMARIZE DIFFERENCES IN ADMIXTURE PROPORTIONS
# ----------------------------------------------
# To summarize the differences in the admixture estimates, rather than
# look at all admixture proportions, I only show the admixture
# proportion for the ancestral population that shows the largest
# difference between the two estimates. Therefore, this plot gives a
# summary of the maximum admixture differences.
cat("Distribution of largest differences in admixture estimates:\n")
## Distribution of largest differences in admixture estimates:
dev.new(height = 5,width = 7.75)
print(quantile(apply(abs(admix1 - admix2),1,max),seq(0,1,0.1)),digits = 2)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 0.020 0.027 0.036 0.040 0.042 0.044 0.049 0.055 0.072 0.080 0.085
i <- apply(abs(admix1 - admix2),1,which.max)
print(ggplot(data.frame(x = admix1[cbind(1:n,i)],
y = admix2[cbind(1:n,i)],
label = panel$label),
aes(x,y,col = label,shape = label)) +
geom_abline(intercept = 0,slope = 1,size = 0.5,color = "gray",
linetype = "dashed") +
geom_point() + theme_minimal() +
scale_color_manual(values = clrs) +
scale_shape_manual(values = shapes) +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
plot.title = element_text(size = 9),
axis.title = element_text(size = 9),
axis.text = element_text(size = 9),
legend.text = element_text(size = 8),
legend.key.height = unit(9,"points")) +
labs(title = "admixture prop. with largest difference in each sample",
x = "admixture proportion from admix.file1",
y = "admixture proportion from admix.file2"))