{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)
This is an introduction to population genetic concepts of genotype frequencies, allele frequencies, Hardy-Weinberg Equilibrium and linkage disequilibrium.
The data are from the 1000 Genomes Project ESN (Esan from Nigeria) and FIN (Finnish) populations. The VCFs consist of a 2 Mb region of chromosome 7 between 6 and 8 Mb. Other data are provided in various databases that you will explore.
This assignment is provided as an .Rmd file which can be open in RStudio and code chunks run interactively.
If possible, please return your updated .Rmd file (with your answers) and the knitted .html file with your answers and figures embedded after the appropriate questions. You can use this .Rmd document as a template.
If for some reason you are unable to return a knitted report, please include your report in a single pdf or word document.
On the NYU Brightspace website, you will find an entry for this assignment in the Assignment section. Please upload your completed assignment at the provided link.
The preferred way to complete the assignment is to first run the code interactively via RStudio (by clicking the green arrow in each code block beginning at the top of the .Rmd), fill in your answers (you can help the TA by putting your answers in bold), and then finally knitting your document into an .html file (See Recitation Week1 and the .Rmd presented in that session) to create a final report.
Please knit your report with your answers highlighted in bold. To do this, type your answer after the question in the .Rmd file, then put your answer between pairs of asterisks, one pair before and one pair after your answer.
Please include your name in the filenames and inside the file of any uploads to Brightspace.
For all tasks in this assignment you may wish to review materials from Weeks 1-4.
Download files required for this homework here: https://drive.google.com/drive/folders/1XbIw6qfEAvoKCmx-cDejEfPpZZLOy-SU?usp=sharing
Place this .Rmd file and the VCF file in the same directory on your personal computer.
Before beginning, you need to install the tidyverse package
(available at CRAN repository). This can be installed with the familiar
install.packages("package name here") from your RStudio
console.
The VariantAnnotation package is also required by following instructions at Bioconductor: https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html
Here you will extract genotype data from the ESN population in the 1000 genomes project and answer basic questions about genotype and allele frequencies and determine if the locus is in HWE.
VCF is the most common file format for population genomic data The VariantAnnotation package is an excellent package for parsing VCF data.
Execute the following code block to read the VCF into memory and parse the data into a ‘tibble’, as table-like object in R.
```{r} library(tidyverse) vcf_file <- “chr7ESN_6000000_8000000.vcf”
vcf.CollapsedVCF <- VariantAnnotation::readVcf(vcf_file) isSNV.biallelic_only.logical <- VariantAnnotation::isSNV(vcf.CollapsedVCF,singleAltOnly=TRUE)
vcf.snps_only.CollapsedVCF <- vcf.CollapsedVCF[isSNV.biallelic_only.logical, ] # select biallelic SNPs only vcf.snps_only.SimpleList <- VariantAnnotation::geno(vcf.snps_only.CollapsedVCF)
vcf.snps_only.tbl_df <- vcf.snps_only.SimpleList[[‘GT’]] %>% as.data.frame(stringsAsFactors = F) %>% rownames_to_column(var = “snp_id”) %>% as_tibble()
Here, we will extract genotypes and count their frequencies from a SNP rs10951973.
```{r}
vcf.snps_only.tbl_df %>%
filter(snp_id == 'rs10951973') %>%
pivot_longer(cols = c(-snp_id),names_to = 'sample_id', values_to = 'genotype') %>%
pull(genotype) %>%
fct_count() %>%
rename('genotype' = 'f')
Q1.1 What are the genotype frequencies of the 3 genotype classes (homozygous reference, homozygous alternate, heterozygous)? (3 points)
Q1.2 What are the frequencies of the 0 and 1 alleles at this locus (2 point)?
Q1.3 What is the expected heterozygosity at this locus (1 point)?
Q1.4 What is the observed heterozygosity (1 point)?
Q1.5 Is this locus at Hardy-Weinberg equilibrium (HWE)? (1 point)?
Perform a Chi-Square Goodness of Fit test to determine your answer.
Go to this calculator online: https://statulator.com/stat/chisq-gof.html. Fill out the fields in the calculator.
Then For your answer, state whether the locus is at HWE, report the Chi-Square value and the p-value.
Q1.6. Which of the following could explain why a locus IS NOT IN HWE? Select all that apply (3 points). (a) inbreeding (b) selection (c) population structure (genotypes from two or more non-random mating populations have been combined in the analysis) (d) random mating
The site frequency spectrum (SFS) is an important property of populations than provides insight into population history, natural selection, and other population phenomena. Here you will generate an SFS for the ESN and FIN based on minor allele frequencies in the region of chromosome 7 from Task 1.
```{r} maf.tbl_df <- vcf.snps_only.tbl_df %>% pivot_longer(cols = -snp_id, names_to = “sample”, values_to = “genotype”) %>% mutate(alt_count = case_when(genotype == “0/0” ~ 0, genotype == “0|0” ~ 0, genotype == “0/1” ~ 1, genotype == “0|1” ~ 1, genotype == “1/0” ~ 1, genotype == “1|0” ~ 1, genotype == “1/1” ~ 2, genotype == “1|1” ~ 2, TRUE ~ NA_real_)) %>% mutate(ref_count = 2 - alt_count) %>% group_by(snp_id) %>% summarise(tot_alt_count = sum(alt_count), tot_ref_count = sum(ref_count)) %>% mutate(tot_minor = case_when(tot_alt_count <= tot_ref_count ~ tot_alt_count, tot_ref_count < tot_alt_count ~ tot_ref_count, TRUE ~ NA_real_)) %>% mutate(maf = tot_minor / (tot_ref_count + tot_alt_count)) %>% filter(maf > 0)
vcf2.CollapsedVCF <- VariantAnnotation::readVcf(“chr7FIN_6000000_8000000.vcf”) vcf2.snps_only.CollapsedVCF <- vcf2.CollapsedVCF[VariantAnnotation::isSNV(vcf2.CollapsedVCF,singleAltOnly=TRUE),] vcf2.snps_only.SimpleList <- VariantAnnotation::geno(vcf2.snps_only.CollapsedVCF)
vcf2.snps_only.tbl_df <- vcf2.snps_only.SimpleList[[‘GT’]] %>% as.data.frame(stringsAsFactors = F) %>% rownames_to_column(var = “snp_id”) %>% as_tibble()
maf2.tbl_df <- vcf2.snps_only.tbl_df %>% pivot_longer(cols = -snp_id, names_to = “sample”, values_to = “genotype”) %>% mutate(alt_count = case_when(genotype == “0/0” ~ 0, genotype == “0|0” ~ 0, genotype == “0/1” ~ 1, genotype == “0|1” ~ 1, genotype == “1/0” ~ 1, genotype == “1|0” ~ 1, genotype == “1/1” ~ 2, genotype == “1|1” ~ 2, TRUE ~ NA_real_)) %>% mutate(ref_count = 2 - alt_count) %>% group_by(snp_id) %>% summarise(tot_alt_count = sum(alt_count), tot_ref_count = sum(ref_count)) %>% mutate(tot_minor = case_when(tot_alt_count <= tot_ref_count ~ tot_alt_count, tot_ref_count < tot_alt_count ~ tot_ref_count, TRUE ~ NA_real_)) %>% mutate(maf = tot_minor / (tot_ref_count + tot_alt_count)) %>% filter(maf > 0)
Now we will plot both minor allele frequency spectra
```{r}
maf.tbl_df <- maf.tbl_df %>%
mutate(pop = 'ESN')
maf2.tbl_df <- maf2.tbl_df %>%
mutate(pop = 'FIN')
bind_rows(maf.tbl_df,maf2.tbl_df) %>%
ggplot() + geom_histogram(aes(x = maf, y = after_stat(count/sum(count)), fill = pop),position = "dodge") +
ylab('Proportion of SNPs') +
xlab('Minor allele frequency (MAF)')
Q2.1 Are the SFS for ESN and FIN populations in the figure above ‘folded’ or ‘unfolded’ SFS (1 point)
Q2.2 Which population has the minor allele SFS shifted towards lower frequency alleles in this section of chromosome 7 (1 point)?
Q2.3 Now lets count the number of SNPs in this region of chromosome 7 and estimate diversity using Watterson’s estimator of the population mutation parameter, theta (=4Neu).
``{r} tribble(~population,~SNPs,~N`,
‘ESN’,nrow(maf.tbl_df),ncol(vcf.snps_only.tbl_df) - 1,
‘FIN’,nrow(maf2.tbl_df),ncol(vcf2.snps_only.tbl_df) - 1)
```
Q2.3a How many chromosomes were sequenced in each population (1 point)?
Q2.3b What is per site Watterson’s Theta (i.e., ‘Theta W’) for each the ESN and FIN populations (2 points)?
The per site Watterson’s theta for ESN is: 3,913.32.
The per site Watterson’s theta for FIN is: 2,135.16 (2,135.157).
Q2.3c Which of the two populations has a smaller effective population size as measured at this locus (1 point)?
Here you will evaluate linkage disequilibrium between a pair of SNPs.
Step 1. Go to the LDPair tool at the National Institutes of Health website https://ldlink.nih.gov/?tab=ldpair Step 2. Select the current release of the human genome GRCh38. Step 3. Enter the RefSNP identifiers for SNPs on chromosome7, rs10951973 and rs10282509 in the two search windows Step 4. Select the ESN population from the dropdown menu
Q3.1 Now answer the following questions.
Q3.1a What two nucleotide alleles are observed at each of these two SNPs (1 point)
Q3.1b What are the haplotypes and their frequencies for these two SNPs in the ESN (1 point)?
The T_T haplotype has a frequency of: 0.813.
The A_T haplotype has a frequency of: 0.146.
The T_C haplotype has a frequency of: 0.03.
The A_C haplotype has a frequency of: 0.01.
Q3.1c What is the D’ value for this pair of SNPs (1 point)?
Q3.1d What is the r2 value (1 point)?
Q3.1e Are these SNPs in linkage equilibrium or disequilibrium based on Chi-Square test of independence (1 point)?
Q3.2 Now repeat these steps for this pair of SNPs in the FIN population. Are the SNPs in linkage equilibrium or disequilibrium in the FIN, based on the Chi-Square test (1 point)?
Q3.3 Now repeat these steps for this pair of SNPs but selecting both FIN and ESN. Are the SNPs in linkage equilibrium or disequilibrium based on the Chi-Square test (1 point)?
Q3.4 What population processes do you think might explain your observations in Q3.1 - Q.3.3 (1 point). Select the single best answer (1 point) (a) a population bottleneck (b) inbreeding (c) natural selection (d) population structure (non-random mating between FIN and ESN)
One of the great challenges in population and medical genomics is to identify genes and individual mutations that cause a disease. For the most part, population geneticists and clinicians have very limited knowledge of which variants in the genome actually cause disease. Genome-wide association studies are a critical tool used in population and medical genomics to identify both genes and possible causal mutations for a disease. Once a gene is identified as associated with a disease, it can suggest the cellular functions and gene networks (e.g., metabolic pathways or gene regulatory networks) involved in development of the disease. While associations in a gene can provide evidence that a gene is involved in a disease, it is more challenging to determine which mutations are causal of the disease.
A 2010 Genome-wide Association Study published in Nature (https://www.nature.com/articles/nature09270 doi: 10.1038/nature09270) by Teslovich et al. surveyed blood lipid levels in 100,000 individuals and performed Genome-wide association study. The goal was to identify genes that control blood lipid levels.
Q4.1 Which of the following are absolutely essential to a GWAS study such as Teslovich et al, without which the study cannot be conducted and disease traits cannot be mapped. Select all that apply (3 points)
Table 1 of Teslovich et al. lists “lead SNPs” detected in the study. A lead SNP is the SNP that showed the strongest statistical association with the trait in the in a given region (or in the entire genome). Table 1 shows that for the TG (Triacylglycerides) and (HDL) High Density Lipoprotein) traits, a lead SNP rs12678919 was found in the Lipoprotein Lipase gene (LPL).
This makes biological sense, but what exactly does this mean? Is this SNP causal for differences in levels of blood TG or HDL? To answer these questions, we need to investigate further.
Visit each of the following websites and perform a query for
rs12678919:
dbSNP https://www.ncbi.nlm.nih.gov/snp/ ClinVar https://www.ncbi.nlm.nih.gov/clinvar/ GWAS catalog at
EMBL-EBI https://www.ebi.ac.uk/gwas/ SNPedia https://www.snpedia.com/index.php/SNPedia
*note: SNPpedia lists a comprehensive list of other databases that may
have information on rs12678919) websites. answer the following questions
about this SNP.
Q4.2 What best describes the type of effect that SNP
has at the molecular level. Select the single best answer (1 point). (a)
It is an intergenic SNP
(b) It is a missense SNP that changes the amino acid
(c) It is a synonymous SNP in a codon, but does not change amino
acid
(d) It introduces a premature termination codon (i.e., a stop gain or
nonsense mutation)
(e) It is an intronic SNP
Q4.3 What is the clinical significance of this SNP as reported in the ClinVar databsae (1 points)? Report the exact terminology used at ClinVar to describe the effect of SNP rs12678919. If on the other hand, the SNP is not in Clinvar, then state so, and what this might mean about the clinical significance of the SNP. (1 point)
Q4.4 How can SNP rs12678919 both be strongly
associated with elevated TG and HDL, and not causal for differences in
levels of HDL or TG? Select the single best answer (1 point). (a) The
effect size of the SNP on TG and HDL in GWAS is too small to be recorded
in ClinVar
(b) This SNP is in linkage disequilibrium with a causal mutation for TG
and HDL traits
(c) This SNP is not in linkage disequilibrium with any causal mutation
for TG and HDL traits
(d) The SNP is a suppressor of the phenotype and therefore incompletely
penetrant
A candidate mutation in the LPL gene that could cause changes to HDL and TG is rs328. If you review this SNP at dbSNP https://www.ncbi.nlm.nih.gov/snp/rs328, you will see that the G allele introduces a premature termination codon (‘Stop Gain’) in the LPL protein.
Q4.5 Is the candidate SNP rs328 in linkage disequilibrium with rs12678919, the lead SNP for TG and HDL traits in the Teslovich study (1 point)?
Go to LDlink/LDPair tool https://ldlink.nih.gov/?tab=ldpair. Select GRCh38, enter the two SNPs, and select the CEU population from Europe. Report the r^2, D’ and significance of the association in the Chi-Square Test of independence.
Q4.6 Based on your answer to Q4.5, could rs328 be the causal SNP responsible for the statistical association between SNP rs12678919 and TG and HDL traits (1 point)?
Q4.7 Now look up SNP rs328 in ClinVar. What is the
clinical significance/interpretation of rs328 (1 point)?
(a) Benign
(b) Likely pathogenic
(c) Pathogenic
Q4.8 Based on your analysis, how confident are you that either SNP rs12678919 (the lead SNP) or SNP rs328 (the linked SNP) are causal for TG and HDL levels in the blood? Do you need to keep looking to find the causal mutation (1 point)?