Answer: The GWAS data set is a large data set which contain more than 100k samples. The data set is about the relationship between individual’s genotypic and phenotypic. By finding the relationships between peoeple’s genotypic and phenotypic, we can find the effect of each genotypic and also be able to predict the possible disease and its causes.
Answer: Genome is the complete set of DNA in an organism; Chromosome is one of the long DNA molecule containing genes in genome; Gene is a segment of DNA that encodes a protein. Chromosomes are made up of genes. Nucleotides are the chemical bases on a gene, the way it arrange on the gene contain information that our body need.
Answer: An Allele is a specific version of a gene or SNP. Each chromosome of a human has a pair of alleles. A SNP is the variation at a location on a gene, this difference can make two alleles considered different to each other.
Answer: Test whether SNP effect size (or a stronger effect) under the null hypothesis (H0: ß = 0, no real association, no effect).
Answer: The data set has pretty large samples and the result of hypothesis test will be pretty large. In this case, we will get a pretty large number of false positives result.
Answer: FWER has strongly control on false positives; FDR allowing some false positives.
Answer: The original genotype data is a massive mathematical matrix contains individual’s biological information. These data usually stored in specialized compressed bioinformatics formats like VCF (Variant Call Format), BGEN, or PLINK binary files. The phenotype data usually store in a standard spreadsheet (txt or csv). Each row contain the specific information of an individual. Columns are more measurable data like traits (age, sex etc).
Answer: The trait under the study is glomerular filtration rate. It is a measurement on how well of one’s kidneys are filtering waste from blood.
Answer: The purpose of the study is to find out what SNPs are highly effect people’s kidneys’ functionality and how much does it affect it.
kidney_summary <- read.table("/Users/yuhe/Downloads/COGENT_Kidney_eGFR_trans_ethnic.txt.gz", header = TRUE)
glimpse(kidney_summary)
## Rows: 6,621,611
## Columns: 10
## $ SNV <chr> "rs3094315", "rs3115860", "rs117086422", "rs28…
## $ Chromosome <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ Position <int> 752566, 753405, 845635, 846078, 846808, 846864…
## $ Effect_allele <chr> "a", "a", "t", "t", "t", "c", "t", "a", "a", "…
## $ Other_allele <chr> "g", "c", "c", "c", "c", "g", "c", "g", "g", "…
## $ Effect_allele_frequency <dbl> 0.8147, 0.8302, 0.1526, 0.1541, 0.1588, 0.1564…
## $ Sample_size <dbl> 183941, 251086, 175014, 175014, 175014, 175014…
## $ P_value <dbl> 0.8758, 0.7181, 0.5213, 0.5026, 0.4363, 0.4771…
## $ Effect <dbl> -0.0656, -0.0156, -0.0285, -0.0089, 0.0191, 0.…
## $ Standard_error <dbl> 0.1412, 0.1483, 0.1377, 0.1371, 0.1293, 0.1327…
ggplot(kidney_summary) +
geom_histogram(aes(P_value), bins = 40, colour = "lightblue")
Answer: The distribution does match my expectation for majority of SNPs are not affect the functionality of one’s kidney.
kidney_summary$P_value_FWER <- p.adjust(kidney_summary$P_value, method = "bonferroni")
kidney_summary$P_value_FDR <- p.adjust(kidney_summary$P_value, method = "fdr")
head(kidney_summary[, c("SNV", "P_value", "P_value_FWER", "P_value_FDR")])
## SNV P_value P_value_FWER P_value_FDR
## 1 rs3094315 0.8758 1 0.9996768
## 2 rs3115860 0.7181 1 0.9994170
## 3 rs117086422 0.5213 1 0.9994170
## 4 rs28612348 0.5026 1 0.9994170
## 5 rs4475691 0.4363 1 0.9983586
## 6 rs950122 0.4771 1 0.9994170
significant_SNPs <- filter(kidney_summary, P_value_FDR < 0.05)
nrow(significant_SNPs)
## [1] 29943
write.csv(significant_SNPs, file = "Significant_SNPs_kidney.csv", row.names = FALSE)