This notebook completes the assignment described in the lecture Introduction to GWAS Data Set. It is organized in three parts:
A Genome-Wide Association Study (GWAS) data set contains genetic information collected from a large group of people together with one or more measured traits (for example height, BMI, or a disease status such as type-2 diabetes). For every individual the data set records the genotype — typically the alleles carried at hundreds of thousands to millions of Single Nucleotide Polymorphism (SNP) positions across the whole genome — and the phenotype — the trait value being studied.
A summary statistics version of a GWAS data set does not contain individual-level genotypes (which are usually protected for privacy reasons). Instead, for every SNP that was tested it reports the result of the association test: the chromosome and position, the effect allele, the estimated effect size (\(\beta\)), the standard error, and the p-value.
GWAS data sets are useful because they let researchers:
These four terms form a nested hierarchy, from the smallest unit of DNA up to the entire genetic material of an organism:
Nucleotide — the smallest building block. Each nucleotide consists of a sugar (deoxyribose), a phosphate group, and one of four chemical bases: Adenine (A), Thymine (T), Cytosine (C), or Guanine (G). Nucleotides are joined in long chains; the two complementary chains together form the DNA double helix, with A always pairing with T and C with G.
Gene — a contiguous stretch of nucleotides (typically thousands of bases long) that codes for a functional product, almost always a protein (or a regulatory RNA). The human genome has roughly 20,000–25,000 protein-coding genes. A gene is therefore a functional sub-sequence of the DNA chain.
Chromosome — a single, very long, tightly packaged DNA molecule that contains many genes plus a great deal of non-coding sequence. Humans have 23 pairs of chromosomes (22 autosomal pairs plus the sex chromosomes X/Y), so 46 chromosomes in total per cell. Each chromosome may carry hundreds to a few thousand genes.
Genome — the complete set of DNA in an organism, i.e. the sum of all the chromosomes. The human genome contains approximately 3.2 billion base pairs.
In one sentence: nucleotides are the letters of the alphabet, genes are meaningful words, chromosomes are very long books, and the genome is the entire library.
A SNP (Single Nucleotide Polymorphism) is a position
in the genome where the DNA sequence differs between people by exactly
one nucleotide. For example, at a particular position on chromosome 1
most individuals may have an A, while a sizeable
fraction of the population has a G at the very same
position. That single-base variation is one SNP, and it is usually given
an identifier such as rs12345.
An allele is one of the alternative versions that can occur at a variable position. A typical SNP has two alleles — the more common one is called the major (or reference) allele, and the less common one the minor allele. The frequency of the minor allele in a population is the Minor Allele Frequency (MAF).
Because humans are diploid (we inherit one chromosome from each parent), every person carries two alleles at every SNP. Using the A/G example, an individual can be:
In GWAS the genotype is usually encoded as the number of copies of the effect allele, i.e. 0, 1, or 2. This count is what gets used as the predictor variable when testing the SNP’s association with a trait.
For each SNP we test whether the genotype at that locus is statistically associated with the trait. The most common setup uses linear regression (for a continuous trait such as height or BMI) or logistic regression (for a binary trait such as case/control status).
Model (for a continuous trait):
\[Y_i = \beta_0 + \beta\, X_i + \varepsilon_i\]
where for individual \(i\):
Hypotheses:
Procedure:
lm() or glm()).In practice the regression also includes covariates (age, sex, the first few genetic principal components for ancestry, etc.) to avoid confounding. For binary traits the same idea applies but with logistic regression, in which \(\beta\) is interpreted as a log odds ratio.
A p-value threshold of \(0.05\) means we are willing to accept a 5% false-positive rate for a single test. In a GWAS, however, we run the test not once but once per SNP, and there are typically hundreds of thousands to millions of SNPs.
If we used \(\alpha = 0.05\) and tested, say, 1,000,000 independent SNPs that all truly had no effect (the global null), the expected number of false positives would be:
\[0.05 \times 1{,}000{,}000 = 50{,}000\]
That is, we would declare 50,000 SNPs “significantly” associated with the trait purely by chance — overwhelming any real signal. This is the classical multiple-testing problem.
To control the family-wise error rate (FWER) — the probability of making any false discovery across all tests — at the conventional level of 5%, the Bonferroni correction divides the per-test threshold by the number of tests. For \(\approx\) 1 million independent SNPs this gives the well-known genome-wide significance threshold:
\[\alpha_{\text{GWAS}} = \frac{0.05}{1{,}000{,}000} = 5 \times 10^{-8}\]
Only SNPs with \(p < 5 \times 10^{-8}\) are typically declared genome-wide significant in published GWAS. Using the naive \(p < 0.05\) would flood the result list with false positives and make every published GWAS unreliable.
The two corrections control different error quantities:
FWER is more conservative because it sets a much higher bar:
In short: FWER asks “are we 95% sure that none of these SNPs is a false positive?” while FDR asks “are we 95% sure that the majority (\(\geq 95\%\)) of our hits are real?“. The first question is harder to answer”yes” to, so fewer SNPs pass — that is what we mean by more conservative. The flip side is that FDR has higher statistical power to detect true positives, which is why it is often preferred in exploratory analyses where missing real signals is more costly than allowing a controlled fraction of false ones.
We use the GWAS summary statistics from the GIANT consortium meta-analysis of body mass index (BMI) published in:
Locke, A. E., Kahali, B., Berndt, S. I., et al. (2015). Genetic studies of body mass index yield new insights for obesity biology. Nature, 518(7538), 197–206. https://doi.org/10.1038/nature14177
The summary statistics are released publicly by the GIANT consortium and can be downloaded from:
http://portals.broadinstitute.org/collaboration/giant/images/1/15/SNP_gwas_mc_merge_nogc.tbl.uniq.gz
(The file is roughly 60 MB compressed and contains \(\approx\) 2.5 million SNPs.)
Reproducibility note. The environment used to author this notebook cannot reach the broadinstitute.org host. The code below first attempts to download the real Locke et al. file when the notebook is knitted in an environment with internet access; if that download fails (offline, firewalled, or in this sandbox) it falls back to a realistic simulated summary-statistics table calibrated to the published Locke et al. study (same column structure, \(\approx\) 1 million SNPs distributed across the 22 autosomes, \(\approx\) 99 truly associated loci embedded in a sea of null SNPs). The downstream analysis — histogram, FWER, FDR, significance calling, CSV export — is identical regardless of which path is taken, so the methodology and the conclusions about FWER vs FDR carry over to the real data.
In the original studies that fed into the GIANT meta-analysis, each contributing cohort held two kinds of individual-level data:
.bed/.bim/.fam, VCF, or BGEN. Rows = individuals, columns =
SNPs.Each cohort ran a linear-regression GWAS of
BMI ~ SNP + age + sex + PCs for every SNP, producing
per-cohort summary statistics. The GIANT consortium then
meta-analyzed those per-cohort summary statistics
across 339,224 individuals of European ancestry to
obtain the final, publicly released file used here.
Body Mass Index (BMI) — a continuous anthropometric trait computed as
\[\text{BMI} = \frac{\text{weight (kg)}}{\text{height (m)}^2}\]
BMI is a widely used proxy for adiposity and is strongly related to obesity and the downstream risk of type-2 diabetes, cardiovascular disease, and several cancers. The phenotype was inverse-normal transformed within each cohort to make it approximately normally distributed before regression.
The motivation for the Locke et al. study was to discover common genetic variants (SNPs with minor allele frequency \(\geq\) 1%) that influence BMI, with the ultimate goals of:
By the end of the analysis the authors identified 97 independent loci reaching genome-wide significance (\(p < 5\times 10^{-8}\)). The CSV table we produce in Part 3 reproduces the same kind of result on our data: the SNPs that survive Bonferroni and FDR correction.
set.seed(20260513)
We use only base R — no extra packages are required.
p.adjust() provides both the Bonferroni (FWER) and
Benjamini–Hochberg (FDR) corrections.
We first attempt to download the actual Locke et al. (2015) BMI summary statistics. If that fails (no internet, firewall, or sandbox) we fall back to a realistic simulation calibrated to that study.
GIANT_URL <- paste0("http://portals.broadinstitute.org/collaboration/giant/",
"images/1/15/SNP_gwas_mc_merge_nogc.tbl.uniq.gz")
LOCAL_FILE <- "bmi_gwas_locke2015.tsv.gz"
try_download_real_data <- function(url, local) {
if (file.exists(local)) return(TRUE)
tryCatch({
message("Attempting to download real GIANT BMI summary statistics ...")
download.file(url, local, mode = "wb", quiet = TRUE)
message("Downloaded to ", local)
TRUE
}, error = function(e) {
message("Download failed: ", conditionMessage(e))
FALSE
}, warning = function(w) {
message("Download failed: ", conditionMessage(w))
FALSE
})
}
simulate_bmi_gwas <- function(n_snps = 1e6, n_true_assoc = 99,
seed = 20260513) {
set.seed(seed)
chrom_lengths <- c(249, 243, 198, 191, 181, 171, 159, 146, 141, 136,
135, 134, 115, 107, 102, 90, 81, 78, 59, 63, 48, 51)
chrom_weights <- chrom_lengths / sum(chrom_lengths)
chrom_assignment <- sample(1:22, n_snps, replace = TRUE, prob = chrom_weights)
positions <- as.integer(runif(n_snps) *
chrom_lengths[chrom_assignment] * 1e6)
bases <- c("A", "C", "G", "T")
effect_allele <- sample(bases, n_snps, replace = TRUE)
complement <- c(A = "G", C = "T", G = "A", T = "C")
other_allele <- unname(complement[effect_allele])
maf <- rbeta(n_snps, 0.5, 2.0) * 0.5
se <- 0.005 / sqrt(pmax(maf * (1 - maf), 0.01) / 0.25)
beta <- rnorm(n_snps) * se
true_idx <- sample.int(n_snps, n_true_assoc)
z_scores <- runif(n_true_assoc, 5.5, 30.0)
signs <- sample(c(-1, 1), n_true_assoc, replace = TRUE)
beta[true_idx] <- signs * z_scores * se[true_idx]
z <- beta / se
pvals <- 2 * pnorm(-abs(z))
rsids <- paste0("rs", 1000000 + seq_len(n_snps) - 1)
df <- data.frame(
SNP = rsids,
CHR = as.integer(chrom_assignment),
BP = positions,
A1 = effect_allele,
A2 = other_allele,
EAF = round(maf, 4),
BETA = round(beta, 6),
SE = round(se, 6),
P = pvals,
N = 339224L,
stringsAsFactors = FALSE
)
df <- df[sample.int(n_snps), ]
rownames(df) <- NULL
df
}
standardize_columns <- function(df) {
names(df) <- toupper(names(df))
alias_map <- c(
"B" = "BETA",
"BETA_HAT" = "BETA",
"EFFECT" = "BETA",
"FREQ1.HAPMAP" = "EAF",
"FREQ1" = "EAF",
"FRQ" = "EAF",
"FREQ" = "EAF",
"MAF" = "EAF",
"POS" = "BP",
"POSITION" = "BP",
"CHROM" = "CHR",
"CHROMOSOME" = "CHR",
"PVAL" = "P",
"P_VALUE" = "P",
"P.VALUE" = "P"
)
for (old in names(alias_map)) {
if (old %in% names(df) && !(alias_map[[old]] %in% names(df))) {
names(df)[names(df) == old] <- alias_map[[old]]
}
}
df
}
got_real <- try_download_real_data(GIANT_URL, LOCAL_FILE)
if (got_real) {
message("Loading the real GIANT BMI summary statistics ...")
df <- read.table(gzfile(LOCAL_FILE), header = TRUE, sep = "",
stringsAsFactors = FALSE, check.names = FALSE)
df <- standardize_columns(df)
message("Loaded ", format(nrow(df), big.mark = ","), " SNPs from the real GIANT file.")
} else {
message("Falling back to a realistic simulated summary statistics table ",
"calibrated to Locke et al. (2015).")
df <- simulate_bmi_gwas()
df <- standardize_columns(df)
message("Generated ", format(nrow(df), big.mark = ","), " simulated SNPs.")
}
cat("Columns available:", paste(names(df), collapse = ", "), "\n\n")
## Columns available: SNP, A1, A2, EAF, BETA, SE, P, N
cat("First 5 rows:\n")
## First 5 rows:
head(df, 5)
## SNP A1 A2 EAF BETA SE P N
## 1 rs1000000 G A 0.6333 0.0001 0.0044 0.98190 231410
## 2 rs10000010 T C 0.5750 -0.0029 0.0030 0.33740 322079
## 3 rs10000012 G C 0.1917 -0.0095 0.0054 0.07853 233933
## 4 rs10000013 A C 0.8333 -0.0095 0.0044 0.03084 233886
## 5 rs10000017 C T 0.7667 -0.0034 0.0046 0.45980 233146
cat("Shape :", nrow(df), "rows x", ncol(df), "columns\n")
## Shape : 2554637 rows x 8 columns
cat("Columns :", paste(names(df), collapse = ", "), "\n\n")
## Columns : SNP, A1, A2, EAF, BETA, SE, P, N
cat("Summary statistics for the p-value column:\n")
## Summary statistics for the p-value column:
summary(df$P)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.232 0.487 0.488 0.745 1.000
Expectation. Under the global null (no SNP associated with the trait) p-values should be uniformly distributed on \([0,1]\), giving a flat histogram. In a real GWAS most SNPs are null, but a small fraction are truly associated and produce very small p-values. The expected shape is therefore:
A clearly U-shaped, sloped, or bimodal histogram would suggest problems (uncorrected population structure, genomic inflation, miscalibrated test statistics, …).
par(mfrow = c(1, 2), mar = c(4.5, 4.5, 2.5, 1.5))
hist(df$P, breaks = 50, col = "steelblue", border = "white",
xlab = "p-value", ylab = "Number of SNPs",
main = "Distribution of p-values")
abline(h = nrow(df) / 50, col = "red", lty = 2, lwd = 1.2)
legend("topright",
legend = sprintf("expected under uniform null = %s",
format(nrow(df) / 50, big.mark = ",")),
lty = 2, col = "red", bty = "n", cex = 0.85)
neglog10p <- -log10(pmax(df$P, 1e-300))
h <- hist(neglog10p, breaks = 80, plot = FALSE)
plot(h$mids, pmax(h$counts, 0.5), type = "h", lwd = 4,
log = "y", col = "darkorange",
xlab = expression(-log[10](p)),
ylab = "Number of SNPs (log scale)",
main = expression(-log[10](p) ~ "distribution (log y-axis)"))
abline(v = -log10(5e-8), col = "red", lty = 2, lwd = 1.2)
legend("topright",
legend = expression("genome-wide significance " ~ 5 %*% 10^-8),
lty = 2, col = "red", bty = "n", cex = 0.85)
P-value distribution. Left: uniform-null bulk on the natural scale. Right: -log10(p) on a log y-axis showing the significance tail.
cat("\nNumber of SNPs with p < 5e-8 :", sum(df$P < 5e-8), "\n")
##
## Number of SNPs with p < 5e-8 : 1860
cat("Number of SNPs with p < 1e-5 :", sum(df$P < 1e-5), "\n")
## Number of SNPs with p < 1e-5 : 4444
cat("Number of SNPs with p < 0.05:", format(sum(df$P < 0.05), big.mark = ","), "\n")
## Number of SNPs with p < 0.05: 181,872
Interpretation of the histogram.
The bulk of the distribution on the left panel is essentially flat, matching the uniform-null expectation (red dashed line shows the expected count per bin if every SNP were null). On top of that flat bulk there is a small excess in the first bin — the truly associated SNPs.
The right panel zooms into the tail by plotting \(-\log_{10}(p)\) on a log y-axis. The genome-wide significance threshold \(p = 5 \times 10^{-8}\) corresponds to \(-\log_{10}(p) \approx 7.30\) (red dashed line). The SNPs to the right of this line are the genome-wide significant ones we want the FWER/FDR procedures to recover.
R’s base function p.adjust() implements both methods
directly.
ALPHA <- 0.05
m <- nrow(df)
df$P_BONF <- p.adjust(df$P, method = "bonferroni")
df$SIG_FWER <- df$P_BONF < ALPHA
fwer_threshold <- ALPHA / m
df$P_FDR_BH <- p.adjust(df$P, method = "BH")
df$SIG_FDR <- df$P_FDR_BH < ALPHA
cat(sprintf("Number of SNPs tested (m) : %s\n", format(m, big.mark = ",")))
## Number of SNPs tested (m) : 2,554,637
cat(sprintf("Bonferroni per-test threshold (alpha/m) : %.3e\n", fwer_threshold))
## Bonferroni per-test threshold (alpha/m) : 1.957e-08
cat(sprintf("Conventional GWAS threshold (5e-8) : 5.000e-08\n\n"))
## Conventional GWAS threshold (5e-8) : 5.000e-08
cat(sprintf("SNPs passing FWER (Bonferroni, alpha=0.05): %s\n",
format(sum(df$SIG_FWER), big.mark = ",")))
## SNPs passing FWER (Bonferroni, alpha=0.05): 1,688
cat(sprintf("SNPs passing FDR (BH, alpha=0.05): %s\n",
format(sum(df$SIG_FDR), big.mark = ",")))
## SNPs passing FDR (BH, alpha=0.05): 10,595
Note that FDR always recovers at least as many SNPs as FWER (every Bonferroni-significant SNP is also BH-significant). The difference between the two counts directly illustrates the conservatism of FWER discussed in Q6.
We define significant in the strict, publication-ready sense — passing the Bonferroni / FWER correction — and report them sorted by p-value.
significant <- df[df$SIG_FWER, ]
significant <- significant[order(significant$P), ]
rownames(significant) <- NULL
cat("Number of genome-wide significant SNPs (FWER):", nrow(significant), "\n\n")
## Number of genome-wide significant SNPs (FWER): 1688
head(significant, 15)
## SNP A1 A2 EAF BETA SE P N P_BONF SIG_FWER
## 1 rs1558902 A T 0.4500 0.0818 0.0031 7.51e-153 320073 1.919e-146 TRUE
## 2 rs1421085 C T 0.4500 0.0813 0.0031 8.83e-151 318182 2.256e-144 TRUE
## 3 rs9937053 A G 0.4750 0.0797 0.0031 4.23e-145 321110 1.081e-138 TRUE
## 4 rs9936385 C T 0.4333 0.0795 0.0031 2.73e-144 321627 6.974e-138 TRUE
## 5 rs8050136 A C 0.4500 0.0780 0.0031 4.42e-144 321930 1.129e-137 TRUE
## 6 rs9935401 A G 0.4500 0.0780 0.0031 5.58e-144 321909 1.425e-137 TRUE
## 7 rs9939973 A G 0.4750 0.0792 0.0031 1.99e-143 321184 5.084e-137 TRUE
## 8 rs9940128 A G 0.4750 0.0792 0.0031 3.45e-143 320981 8.813e-137 TRUE
## 9 rs11075985 A C 0.4833 0.0790 0.0031 1.36e-142 321197 3.474e-136 TRUE
## 10 rs9923147 T C 0.4750 0.0790 0.0031 1.36e-142 321145 3.474e-136 TRUE
## 11 rs9923544 T C 0.4750 0.0790 0.0031 1.79e-142 321112 4.573e-136 TRUE
## 12 rs1121980 A G 0.4750 0.0790 0.0031 1.80e-142 321255 4.598e-136 TRUE
## 13 rs7193144 C T 0.4417 0.0784 0.0031 1.76e-140 321890 4.496e-134 TRUE
## 14 rs17817449 G T 0.4500 0.0781 0.0031 1.54e-139 322064 3.934e-133 TRUE
## 15 rs17817964 T C 0.4417 0.0779 0.0031 7.77e-139 321602 1.985e-132 TRUE
## P_FDR_BH SIG_FDR
## 1 1.919e-146 TRUE
## 2 1.128e-144 TRUE
## 3 3.602e-139 TRUE
## 4 1.744e-138 TRUE
## 5 2.258e-138 TRUE
## 6 2.376e-138 TRUE
## 7 7.262e-138 TRUE
## 8 1.102e-137 TRUE
## 9 3.474e-137 TRUE
## 10 3.474e-137 TRUE
## 11 3.832e-137 TRUE
## 12 3.832e-137 TRUE
## 13 3.459e-135 TRUE
## 14 2.810e-134 TRUE
## 15 1.180e-133 TRUE
wanted <- c("SNP", "CHR", "BP", "A1", "A2", "EAF", "BETA", "SE", "P",
"P_BONF", "P_FDR_BH")
display_cols <- intersect(wanted, names(significant))
top_view <- head(significant[, display_cols, drop = FALSE], 20)
top_view$neglog10P <- -log10(pmax(top_view$P, 1e-300))
top_view
## SNP A1 A2 EAF BETA SE P P_BONF P_FDR_BH
## 1 rs1558902 A T 0.4500 0.0818 0.0031 7.51e-153 1.919e-146 1.919e-146
## 2 rs1421085 C T 0.4500 0.0813 0.0031 8.83e-151 2.256e-144 1.128e-144
## 3 rs9937053 A G 0.4750 0.0797 0.0031 4.23e-145 1.081e-138 3.602e-139
## 4 rs9936385 C T 0.4333 0.0795 0.0031 2.73e-144 6.974e-138 1.744e-138
## 5 rs8050136 A C 0.4500 0.0780 0.0031 4.42e-144 1.129e-137 2.258e-138
## 6 rs9935401 A G 0.4500 0.0780 0.0031 5.58e-144 1.425e-137 2.376e-138
## 7 rs9939973 A G 0.4750 0.0792 0.0031 1.99e-143 5.084e-137 7.262e-138
## 8 rs9940128 A G 0.4750 0.0792 0.0031 3.45e-143 8.813e-137 1.102e-137
## 9 rs11075985 A C 0.4833 0.0790 0.0031 1.36e-142 3.474e-136 3.474e-137
## 10 rs9923147 T C 0.4750 0.0790 0.0031 1.36e-142 3.474e-136 3.474e-137
## 11 rs9923544 T C 0.4750 0.0790 0.0031 1.79e-142 4.573e-136 3.832e-137
## 12 rs1121980 A G 0.4750 0.0790 0.0031 1.80e-142 4.598e-136 3.832e-137
## 13 rs7193144 C T 0.4417 0.0784 0.0031 1.76e-140 4.496e-134 3.459e-135
## 14 rs17817449 G T 0.4500 0.0781 0.0031 1.54e-139 3.934e-133 2.810e-134
## 15 rs17817964 T C 0.4417 0.0779 0.0031 7.77e-139 1.985e-132 1.180e-133
## 16 rs7185735 G A 0.4417 0.0779 0.0031 7.77e-139 1.985e-132 1.180e-133
## 17 rs8051591 G A 0.4500 0.0779 0.0031 7.85e-139 2.005e-132 1.180e-133
## 18 rs11075989 T C 0.4500 0.0778 0.0031 2.29e-138 5.850e-132 3.079e-133
## 19 rs11075990 G A 0.4500 0.0778 0.0031 2.29e-138 5.850e-132 3.079e-133
## 20 rs9923233 C G 0.4583 0.0776 0.0031 8.83e-138 2.256e-131 1.128e-132
## neglog10P
## 1 152.1
## 2 150.1
## 3 144.4
## 4 143.6
## 5 143.4
## 6 143.3
## 7 142.7
## 8 142.5
## 9 141.9
## 10 141.9
## 11 141.7
## 12 141.7
## 13 139.8
## 14 138.8
## 15 138.1
## 16 138.1
## 17 138.1
## 18 137.6
## 19 137.6
## 20 137.1
output_path <- "significant_snps.csv"
write.csv(significant, output_path, row.names = FALSE)
cat(sprintf("Saved %d significant SNPs to: %s\n\n",
nrow(significant), output_path))
## Saved 1688 significant SNPs to: significant_snps.csv
cat("File preview (first 5 lines):\n")
## File preview (first 5 lines):
cat(paste(rep("-", 70), collapse = ""), "\n")
## ----------------------------------------------------------------------
preview <- readLines(output_path, n = 5)
for (line in preview) cat(line, "\n")
## "SNP","A1","A2","EAF","BETA","SE","P","N","P_BONF","SIG_FWER","P_FDR_BH","SIG_FDR"
## "rs1558902","A","T",0.45,0.0818,0.0031,7.51e-153,320073,1.918532387e-146,TRUE,1.918532387e-146,TRUE
## "rs1421085","C","T",0.45,0.0813,0.0031,8.83e-151,318182,2.255744471e-144,TRUE,1.1278722355e-144,TRUE
## "rs9937053","A","G",0.475,0.0797,0.0031,4.23e-145,321110,1.080611451e-138,TRUE,3.60203817e-139,TRUE
## "rs9936385","C","T",0.4333,0.0795,0.0031,2.73e-144,321627,6.97415901e-138,TRUE,1.7435397525e-138,TRUE