{r setup, include=FALSE} knitr::opts_chunk$set(echo = TRUE)

Genomics of Human Populations Assignment 1

This is an introduction to population genetic concepts of genotype frequencies, allele frequencies, Hardy-Weinberg Equilibrium and linkage disequilibrium.

About the data

The data are from the 1000 Genomes Project ESN (Esan from Nigeria) and FIN (Finnish) populations. The VCFs consist of a 2 Mb region of chromosome 7 between 6 and 8 Mb. Other data are provided in various databases that you will explore.

Completing your assignment

This assignment is provided as an .Rmd file which can be open in RStudio and code chunks run interactively.

If possible, please return your updated .Rmd file (with your answers) and the knitted .html file with your answers and figures embedded after the appropriate questions. You can use this .Rmd document as a template.

If for some reason you are unable to return a knitted report, please include your report in a single pdf or word document.

On the NYU Brightspace website, you will find an entry for this assignment in the Assignment section. Please upload your completed assignment at the provided link.

The preferred way to complete the assignment is to first run the code interactively via RStudio (by clicking the green arrow in each code block beginning at the top of the .Rmd), fill in your answers (you can help the TA by putting your answers in bold), and then finally knitting your document into an .html file (See Recitation Week1 and the .Rmd presented in that session) to create a final report.

Please knit your report with your answers highlighted in bold. To do this, type your answer after the question in the .Rmd file, then put your answer between pairs of asterisks, one pair before and one pair after your answer.

Please include your name in the filenames and inside the file of any uploads to Brightspace.

General instructions

For all tasks in this assignment you may wish to review materials from Weeks 1-4.

Download files required for this homework here: https://drive.google.com/drive/folders/1XbIw6qfEAvoKCmx-cDejEfPpZZLOy-SU?usp=sharing

Place this .Rmd file and the VCF file in the same directory on your personal computer.

Before beginning, you need to install the tidyverse package (available at CRAN repository). This can be installed with the familiar install.packages("package name here") from your RStudio console.

The VariantAnnotation package is also required by following instructions at Bioconductor: https://bioconductor.org/packages/release/bioc/html/VariantAnnotation.html

Task 1: Genotype frequencies, allele frequencies and Hardy-Weinberg Equilibrium (HWE)

Here you will extract genotype data from the ESN population in the 1000 genomes project and answer basic questions about genotype and allele frequencies and determine if the locus is in HWE.

VCF is the most common file format for population genomic data The VariantAnnotation package is an excellent package for parsing VCF data.

Execute the following code block to read the VCF into memory and parse the data into a ‘tibble’, as table-like object in R.

```{r} library(tidyverse) vcf_file <- “chr7ESN_6000000_8000000.vcf”

vcf.CollapsedVCF <- VariantAnnotation::readVcf(vcf_file) isSNV.biallelic_only.logical <- VariantAnnotation::isSNV(vcf.CollapsedVCF,singleAltOnly=TRUE)

vcf.snps_only.CollapsedVCF <- vcf.CollapsedVCF[isSNV.biallelic_only.logical, ] # select biallelic SNPs only vcf.snps_only.SimpleList <- VariantAnnotation::geno(vcf.snps_only.CollapsedVCF)

vcf.snps_only.tbl_df <- vcf.snps_only.SimpleList[[‘GT’]] %>% as.data.frame(stringsAsFactors = F) %>% rownames_to_column(var = “snp_id”) %>% as_tibble()


Here, we will extract genotypes and count their frequencies from a SNP rs10951973.

```{r}
vcf.snps_only.tbl_df %>%
  filter(snp_id == 'rs10951973') %>%
  pivot_longer(cols = c(-snp_id),names_to = 'sample_id', values_to = 'genotype') %>%
  pull(genotype) %>%
  fct_count() %>%
  rename('genotype' = 'f')

Q1.1 What are the genotype frequencies of the 3 genotype classes (homozygous reference, homozygous alternate, heterozygous)? (3 points)

The heterozygous frequency is: 20/86 or 0.23
The homozygous reference frequency is: 65/86 or 0.76 (0.75581)
The homozgyous alternate frequency is: 1/86 or 0.01

Q1.2 What are the frequencies of the 0 and 1 alleles at this locus (2 point)?

The allele frequency of 0 at this locus is: 0.88
The allele frequency of 1 at this locus is: 0.12 (0.11627)

Q1.3 What is the expected heterozygosity at this locus (1 point)?

The expected heterozygosity is: 0.22

Q1.4 What is the observed heterozygosity (1 point)?

The observed heterozygosity is: 0.23

Q1.5 Is this locus at Hardy-Weinberg equilibrium (HWE)? (1 point)?

Yes, this locus is at HWE.

Perform a Chi-Square Goodness of Fit test to determine your answer.

Go to this calculator online: https://statulator.com/stat/chisq-gof.html. Fill out the fields in the calculator.

Then For your answer, state whether the locus is at HWE, report the Chi-Square value and the p-value.

Q1.6. Which of the following could explain why a locus IS NOT IN HWE? Select all that apply (3 points). (a) inbreeding (b) selection (c) population structure (genotypes from two or more non-random mating populations have been combined in the analysis) (d) random mating

Choice A, B, C, D (inbreeding; selection; population structure; and random mating) are all factors that can explain how a locus is not in HWE because each causes a significant deviation of expected allele frequencies or simply violate HWE principles.

Task 2: Generate a minor allele frequency spectrum for ESN and FIN populations

The site frequency spectrum (SFS) is an important property of populations than provides insight into population history, natural selection, and other population phenomena. Here you will generate an SFS for the ESN and FIN based on minor allele frequencies in the region of chromosome 7 from Task 1.

```{r} maf.tbl_df <- vcf.snps_only.tbl_df %>% pivot_longer(cols = -snp_id, names_to = “sample”, values_to = “genotype”) %>% mutate(alt_count = case_when(genotype == “0/0” ~ 0, genotype == “0|0” ~ 0, genotype == “0/1” ~ 1, genotype == “0|1” ~ 1, genotype == “1/0” ~ 1, genotype == “1|0” ~ 1, genotype == “1/1” ~ 2, genotype == “1|1” ~ 2, TRUE ~ NA_real_)) %>% mutate(ref_count = 2 - alt_count) %>% group_by(snp_id) %>% summarise(tot_alt_count = sum(alt_count), tot_ref_count = sum(ref_count)) %>% mutate(tot_minor = case_when(tot_alt_count <= tot_ref_count ~ tot_alt_count, tot_ref_count < tot_alt_count ~ tot_ref_count, TRUE ~ NA_real_)) %>% mutate(maf = tot_minor / (tot_ref_count + tot_alt_count)) %>% filter(maf > 0)

vcf2.CollapsedVCF <- VariantAnnotation::readVcf(“chr7FIN_6000000_8000000.vcf”) vcf2.snps_only.CollapsedVCF <- vcf2.CollapsedVCF[VariantAnnotation::isSNV(vcf2.CollapsedVCF,singleAltOnly=TRUE),] vcf2.snps_only.SimpleList <- VariantAnnotation::geno(vcf2.snps_only.CollapsedVCF)

Convert matrix to a tibble with new column called “snp_id”

vcf2.snps_only.tbl_df <- vcf2.snps_only.SimpleList[[‘GT’]] %>% as.data.frame(stringsAsFactors = F) %>% rownames_to_column(var = “snp_id”) %>% as_tibble()

maf2.tbl_df <- vcf2.snps_only.tbl_df %>% pivot_longer(cols = -snp_id, names_to = “sample”, values_to = “genotype”) %>% mutate(alt_count = case_when(genotype == “0/0” ~ 0, genotype == “0|0” ~ 0, genotype == “0/1” ~ 1, genotype == “0|1” ~ 1, genotype == “1/0” ~ 1, genotype == “1|0” ~ 1, genotype == “1/1” ~ 2, genotype == “1|1” ~ 2, TRUE ~ NA_real_)) %>% mutate(ref_count = 2 - alt_count) %>% group_by(snp_id) %>% summarise(tot_alt_count = sum(alt_count), tot_ref_count = sum(ref_count)) %>% mutate(tot_minor = case_when(tot_alt_count <= tot_ref_count ~ tot_alt_count, tot_ref_count < tot_alt_count ~ tot_ref_count, TRUE ~ NA_real_)) %>% mutate(maf = tot_minor / (tot_ref_count + tot_alt_count)) %>% filter(maf > 0)


Now we will plot both minor allele frequency spectra
```{r}
maf.tbl_df <- maf.tbl_df %>%
  mutate(pop = 'ESN')

maf2.tbl_df <- maf2.tbl_df %>%
  mutate(pop = 'FIN')

bind_rows(maf.tbl_df,maf2.tbl_df) %>%
  ggplot() + geom_histogram(aes(x = maf, y = after_stat(count/sum(count)), fill = pop),position = "dodge") +
  ylab('Proportion of SNPs') +
  xlab('Minor allele frequency (MAF)')

Q2.1 Are the SFS for ESN and FIN populations in the figure above ‘folded’ or ‘unfolded’ SFS (1 point)

This is a folded site frequency spectrum graph for ESN and FIN populations.

Q2.2 Which population has the minor allele SFS shifted towards lower frequency alleles in this section of chromosome 7 (1 point)?

Then FIN population since their frequency is not as high as the ESN throughout the figure.

Q2.3 Now lets count the number of SNPs in this region of chromosome 7 and estimate diversity using Watterson’s estimator of the population mutation parameter, theta (=4Neu).

``{r} tribble(~population,~SNPs,~N`, ‘ESN’,nrow(maf.tbl_df),ncol(vcf.snps_only.tbl_df) - 1, ‘FIN’,nrow(maf2.tbl_df),ncol(vcf2.snps_only.tbl_df) - 1)

```

Q2.3a How many chromosomes were sequenced in each population (1 point)?

Only one. This is chromosome 7.

Q2.3b What is per site Watterson’s Theta (i.e., ‘Theta W’) for each the ESN and FIN populations (2 points)?

The per site Watterson’s theta for ESN is: 3,913.32.
The per site Watterson’s theta for FIN is: 2,135.16 (2,135.157).

Q2.3c Which of the two populations has a smaller effective population size as measured at this locus (1 point)?

The FIN population has a smaller effective population size compared to the ESN population.

Task 3: Linkage disequilibrium in ESN and FIN populations

Here you will evaluate linkage disequilibrium between a pair of SNPs.

Step 1. Go to the LDPair tool at the National Institutes of Health website https://ldlink.nih.gov/?tab=ldpair Step 2. Select the current release of the human genome GRCh38. Step 3. Enter the RefSNP identifiers for SNPs on chromosome7, rs10951973 and rs10282509 in the two search windows Step 4. Select the ESN population from the dropdown menu

Q3.1 Now answer the following questions.

Q3.1a What two nucleotide alleles are observed at each of these two SNPs (1 point)

The two nucleotide alleles observed at each of these two SNPs: are AT and CT.

Q3.1b What are the haplotypes and their frequencies for these two SNPs in the ESN (1 point)?

The T_T haplotype has a frequency of: 0.813.
The A_T haplotype has a frequency of: 0.146.
The T_C haplotype has a frequency of: 0.03.
The A_C haplotype has a frequency of: 0.01.

Q3.1c What is the D’ value for this pair of SNPs (1 point)?

D’ value for this pair of SNPs is: 0.1108

Q3.1d What is the r2 value (1 point)?

The R2 value is: 0.0028.

Q3.1e Are these SNPs in linkage equilibrium or disequilibrium based on Chi-Square test of independence (1 point)?

These SNPs are in linkage equilibrum based on the Chi-Square test of independence.

Q3.2 Now repeat these steps for this pair of SNPs in the FIN population. Are the SNPs in linkage equilibrium or disequilibrium in the FIN, based on the Chi-Square test (1 point)?

This pair of SNPs in the FIN population are in linkage equilbrium basedn on the Chi-Square test.

Q3.3 Now repeat these steps for this pair of SNPs but selecting both FIN and ESN. Are the SNPs in linkage equilibrium or disequilibrium based on the Chi-Square test (1 point)?

This pair of SNPs (for both populations, FIN and ESN) are in linkage equilibrium based on the Chi-Square test.

Q3.4 What population processes do you think might explain your observations in Q3.1 - Q.3.3 (1 point). Select the single best answer (1 point) (a) a population bottleneck (b) inbreeding (c) natural selection (d) population structure (non-random mating between FIN and ESN)

(D): The population process of population structure explains the observation in previous observations.

Task 4: Linkage disequilibrium in the study of human disease

One of the great challenges in population and medical genomics is to identify genes and individual mutations that cause a disease. For the most part, population geneticists and clinicians have very limited knowledge of which variants in the genome actually cause disease. Genome-wide association studies are a critical tool used in population and medical genomics to identify both genes and possible causal mutations for a disease. Once a gene is identified as associated with a disease, it can suggest the cellular functions and gene networks (e.g., metabolic pathways or gene regulatory networks) involved in development of the disease. While associations in a gene can provide evidence that a gene is involved in a disease, it is more challenging to determine which mutations are causal of the disease.

A 2010 Genome-wide Association Study published in Nature (https://www.nature.com/articles/nature09270 doi: 10.1038/nature09270) by Teslovich et al. surveyed blood lipid levels in 100,000 individuals and performed Genome-wide association study. The goal was to identify genes that control blood lipid levels.

Q4.1 Which of the following are absolutely essential to a GWAS study such as Teslovich et al, without which the study cannot be conducted and disease traits cannot be mapped. Select all that apply (3 points)

Table 1 of Teslovich et al. lists “lead SNPs” detected in the study. A lead SNP is the SNP that showed the strongest statistical association with the trait in the in a given region (or in the entire genome). Table 1 shows that for the TG (Triacylglycerides) and (HDL) High Density Lipoprotein) traits, a lead SNP rs12678919 was found in the Lipoprotein Lipase gene (LPL).

This makes biological sense, but what exactly does this mean? Is this SNP causal for differences in levels of blood TG or HDL? To answer these questions, we need to investigate further.

Visit each of the following websites and perform a query for rs12678919:
dbSNP https://www.ncbi.nlm.nih.gov/snp/ ClinVar https://www.ncbi.nlm.nih.gov/clinvar/ GWAS catalog at EMBL-EBI https://www.ebi.ac.uk/gwas/ SNPedia https://www.snpedia.com/index.php/SNPedia
*note: SNPpedia lists a comprehensive list of other databases that may have information on rs12678919) websites. answer the following questions about this SNP.

Q4.2 What best describes the type of effect that SNP has at the molecular level. Select the single best answer (1 point). (a) It is an intergenic SNP
(b) It is a missense SNP that changes the amino acid
(c) It is a synonymous SNP in a codon, but does not change amino acid
(d) It introduces a premature termination codon (i.e., a stop gain or nonsense mutation)
(e) It is an intronic SNP

(A): It is an intergenic SNP based on the search results of the GWAS catalog.

Q4.3 What is the clinical significance of this SNP as reported in the ClinVar databsae (1 points)? Report the exact terminology used at ClinVar to describe the effect of SNP rs12678919. If on the other hand, the SNP is not in Clinvar, then state so, and what this might mean about the clinical significance of the SNP. (1 point)

There is no reported clinical significance in Clinvar, indicating that this is a SNP that is not significant with disease association currently. Another explanation is that this variant did not pass validation or it was sent back to the submitter for correction or deletion in the databank.

Q4.4 How can SNP rs12678919 both be strongly associated with elevated TG and HDL, and not causal for differences in levels of HDL or TG? Select the single best answer (1 point). (a) The effect size of the SNP on TG and HDL in GWAS is too small to be recorded in ClinVar
(b) This SNP is in linkage disequilibrium with a causal mutation for TG and HDL traits
(c) This SNP is not in linkage disequilibrium with any causal mutation for TG and HDL traits
(d) The SNP is a suppressor of the phenotype and therefore incompletely penetrant

(B): This SNP is in linkage disequilbrium with a causal mutation for TG and HDL traits.

A candidate mutation in the LPL gene that could cause changes to HDL and TG is rs328. If you review this SNP at dbSNP https://www.ncbi.nlm.nih.gov/snp/rs328, you will see that the G allele introduces a premature termination codon (‘Stop Gain’) in the LPL protein.

Q4.5 Is the candidate SNP rs328 in linkage disequilibrium with rs12678919, the lead SNP for TG and HDL traits in the Teslovich study (1 point)?

The candidate SNP rs328 is in linkage disequilibrium with rs1267819 from the Teslovich study.

Go to LDlink/LDPair tool https://ldlink.nih.gov/?tab=ldpair. Select GRCh38, enter the two SNPs, and select the CEU population from Europe. Report the r^2, D’ and significance of the association in the Chi-Square Test of independence.

The r2 value is: 1.0. The D’ value is: 1.0. The Chi-Square Test of independence suggests that there is high correlation between the two SNPs(198.00 value) and then the p-value is less than 0.05, where this suggests this association to be significant. Because it is <0.0001 (p-value), the statistical association becomes even higher, supported by the two values of r2 and D prime.

Q4.6 Based on your answer to Q4.5, could rs328 be the causal SNP responsible for the statistical association between SNP rs12678919 and TG and HDL traits (1 point)?

Yes, the rs328 can be the casual SNP that causes statistical association between SNP rs12678919 and TG and HDL traits.

Q4.7 Now look up SNP rs328 in ClinVar. What is the clinical significance/interpretation of rs328 (1 point)?
(a) Benign
(b) Likely pathogenic
(c) Pathogenic

(A): Benign.

Q4.8 Based on your analysis, how confident are you that either SNP rs12678919 (the lead SNP) or SNP rs328 (the linked SNP) are causal for TG and HDL levels in the blood? Do you need to keep looking to find the causal mutation (1 point)?

From the values derived above, it provides a strong case of causality however further tests need to fully elucidate the role and how it plays out in the body. We just provided the groundwork to undrestnad the biological mechanism, sturctural changes in application both within and outside of the biological system, and how it interacts outside of a sampled population. Both structural and information from other populations (outside from sample) need to be included to answer this question with more confidence.

Assignment 1

Edward Friedman

10/1/24