Part 1 - Basic Questions

What is the GWAS data set about? How is it useful?

Answer: The GWAS data set is a large data set which contain more than 100k samples. The data set is about the relationship between individual’s genotypic and phenotypic. By finding the relationships between peoeple’s genotypic and phenotypic, we can find the effect of each genotypic and also be able to predict the possible disease and its causes.

Explain the relationships between genome, chromosome, gene, and nucleotide.

Answer: Genome is the complete set of DNA in an organism; Chromosome is one of the long DNA molecule containing genes in genome; Gene is a segment of DNA that encodes a protein. Chromosomes are made up of genes. Nucleotides are the chemical bases on a gene, the way it arrange on the gene contain information that our body need.

Explain what an allele and SNP are.

Answer: An Allele is a specific version of a gene or SNP. Each chromosome of a human has a pair of alleles. A SNP is the variation at a location on a gene, this difference can make two alleles considered different to each other.

Explain how to perform a hypothesis test to see whether an SNP is associated with a trait.

Answer: Test whether SNP effect size (or a stronger effect) under the null hypothesis (H0: ß = 0, no real association, no effect).

Explain why p = 0.05 is not appropriate in analyzing GWAS data.

Answer: The data set has pretty large samples and the result of hypothesis test will be pretty large. In this case, we will get a pretty large number of false positives result.

Explain why FWER is more conservative than the FDR method.

Answer: FWER has strongly control on false positives; FDR allowing some false positives.

Part 2 - Finding a Summary Statistics Data Set

Find a summary statistics data set that studies the association between SNPs and some given traits.

Read and understand the study and answer the following about the data set in your own words:

Provide the source (shall be from a published study)

Answer: https://www.ebi.ac.uk/gwas/studies/GCST007344.

Describe how the original genotype and phenotype data look.

Answer: The original genotype data is a massive mathematical matrix contains individual’s biological information. These data usually stored in specialized compressed bioinformatics formats like VCF (Variant Call Format), BGEN, or PLINK binary files. The phenotype data usually store in a standard spreadsheet (txt or csv). Each row contain the specific information of an individual. Columns are more measurable data like traits (age, sex etc).

Describe the trait(s) under study

Answer: The trait under the study is glomerular filtration rate. It is a measurement on how well of one’s kidneys are filtering waste from blood.

Describe the purpose and motivation of the study.

Answer: The purpose of the study is to find out what SNPs are highly effect people’s kidneys’ functionality and how much does it affect it.

Part 3 - Implementing FWER and FDR in Code

Write an R markdown or Python notebook to:

Load the summary statistics data set into a data frame.

kidney_summary <- read.table("/Users/yuhe/Downloads/COGENT_Kidney_eGFR_trans_ethnic.txt.gz", header = TRUE)

glimpse(kidney_summary)
## Rows: 6,621,611
## Columns: 10
## $ SNV                     <chr> "rs3094315", "rs3115860", "rs117086422", "rs28…
## $ Chromosome              <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Position                <int> 752566, 753405, 845635, 846078, 846808, 846864…
## $ Effect_allele           <chr> "a", "a", "t", "t", "t", "c", "t", "a", "a", "…
## $ Other_allele            <chr> "g", "c", "c", "c", "c", "g", "c", "g", "g", "…
## $ Effect_allele_frequency <dbl> 0.8147, 0.8302, 0.1526, 0.1541, 0.1588, 0.1564…
## $ Sample_size             <dbl> 183941, 251086, 175014, 175014, 175014, 175014…
## $ P_value                 <dbl> 0.8758, 0.7181, 0.5213, 0.5026, 0.4363, 0.4771…
## $ Effect                  <dbl> -0.0656, -0.0156, -0.0285, -0.0089, 0.0191, 0.…
## $ Standard_error          <dbl> 0.1412, 0.1483, 0.1377, 0.1371, 0.1293, 0.1327…

Plot a histogram of p-values. Does the distribution match your expectation?

ggplot(kidney_summary) +
  geom_histogram(aes(P_value), bins = 40, colour = "lightblue")

Answer: The distribution does match my expectation for majority of SNPs are not affect the functionality of one’s kidney.

Apply FWER and FDR to the data frame

kidney_summary$P_value_FWER <- p.adjust(kidney_summary$P_value, method = "bonferroni")

kidney_summary$P_value_FDR <- p.adjust(kidney_summary$P_value, method = "fdr")

head(kidney_summary[, c("SNV", "P_value", "P_value_FWER", "P_value_FDR")])
##           SNV P_value P_value_FWER P_value_FDR
## 1   rs3094315  0.8758            1   0.9996768
## 2   rs3115860  0.7181            1   0.9994170
## 3 rs117086422  0.5213            1   0.9994170
## 4  rs28612348  0.5026            1   0.9994170
## 5   rs4475691  0.4363            1   0.9983586
## 6    rs950122  0.4771            1   0.9994170

Find SNPs significantly associated with the trait.

significant_SNPs <- filter(kidney_summary, P_value_FDR < 0.05)

nrow(significant_SNPs)
## [1] 29943

Save the significant SNPs into a CSV result table.

write.csv(significant_SNPs, file = "Significant_SNPs_kidney.csv", row.names = FALSE)