GWAS Data Analysis Assignment

This notebook completes the assignment described in the lecture Introduction to GWAS Data Set. It is organized in three parts:

Conceptual questions — written answers in our own words.
Dataset description — source, structure, traits, and motivation.
Analysis — load the summary statistics, plot the p-value distribution, apply FWER (Bonferroni) and FDR (Benjamini–Hochberg) corrections, identify genome-wide significant SNPs, and export them to a CSV table.

Part 1 — Conceptual Questions

Q1. What is the GWAS data set about? How is it useful?

A Genome-Wide Association Study (GWAS) data set contains genetic information collected from a large group of people together with one or more measured traits (for example height, BMI, or a disease status such as type-2 diabetes). For every individual the data set records the genotype — typically the alleles carried at hundreds of thousands to millions of Single Nucleotide Polymorphism (SNP) positions across the whole genome — and the phenotype — the trait value being studied.

A summary statistics version of a GWAS data set does not contain individual-level genotypes (which are usually protected for privacy reasons). Instead, for every SNP that was tested it reports the result of the association test: the chromosome and position, the effect allele, the estimated effect size (\(\beta\)), the standard error, and the p-value.

GWAS data sets are useful because they let researchers:

Discover genetic risk factors for common, complex diseases (cardiovascular disease, schizophrenia, diabetes, cancers, etc.) without any prior biological hypothesis about which gene is involved.
Build polygenic risk scores (PRS) that combine the effects of many SNPs into a single number that can be used to estimate an individual’s genetic predisposition to a trait or disease.
Identify drug targets — once a gene whose variation affects a disease is found, that gene’s protein is a candidate for therapeutic intervention.
Compare populations and study the genetic architecture of traits (how many variants contribute, of what effect sizes, and how heritability is distributed across the genome).
Inform personalized medicine — predicting which patients are at higher risk or who will respond to a specific drug.

Q2. Explain the relationships between genome, chromosome, gene, and nucleotide.

These four terms form a nested hierarchy, from the smallest unit of DNA up to the entire genetic material of an organism:

Nucleotide — the smallest building block. Each nucleotide consists of a sugar (deoxyribose), a phosphate group, and one of four chemical bases: Adenine (A), Thymine (T), Cytosine (C), or Guanine (G). Nucleotides are joined in long chains; the two complementary chains together form the DNA double helix, with A always pairing with T and C with G.
Gene — a contiguous stretch of nucleotides (typically thousands of bases long) that codes for a functional product, almost always a protein (or a regulatory RNA). The human genome has roughly 20,000–25,000 protein-coding genes. A gene is therefore a functional sub-sequence of the DNA chain.
Chromosome — a single, very long, tightly packaged DNA molecule that contains many genes plus a great deal of non-coding sequence. Humans have 23 pairs of chromosomes (22 autosomal pairs plus the sex chromosomes X/Y), so 46 chromosomes in total per cell. Each chromosome may carry hundreds to a few thousand genes.
Genome — the complete set of DNA in an organism, i.e. the sum of all the chromosomes. The human genome contains approximately 3.2 billion base pairs.

In one sentence: nucleotides are the letters of the alphabet, genes are meaningful words, chromosomes are very long books, and the genome is the entire library.

Q3. Explain what an allele and a SNP are.

A SNP (Single Nucleotide Polymorphism) is a position in the genome where the DNA sequence differs between people by exactly one nucleotide. For example, at a particular position on chromosome 1 most individuals may have an A, while a sizeable fraction of the population has a G at the very same position. That single-base variation is one SNP, and it is usually given an identifier such as rs12345.

An allele is one of the alternative versions that can occur at a variable position. A typical SNP has two alleles — the more common one is called the major (or reference) allele, and the less common one the minor allele. The frequency of the minor allele in a population is the Minor Allele Frequency (MAF).

Because humans are diploid (we inherit one chromosome from each parent), every person carries two alleles at every SNP. Using the A/G example, an individual can be:

Homozygous reference: AA — both copies are A.
Heterozygous: AG — one A, one G.
Homozygous alternative: GG — both copies are G.

In GWAS the genotype is usually encoded as the number of copies of the effect allele, i.e. 0, 1, or 2. This count is what gets used as the predictor variable when testing the SNP’s association with a trait.

Q4. Explain how to perform a hypothesis test to see whether a SNP is associated with a trait.

For each SNP we test whether the genotype at that locus is statistically associated with the trait. The most common setup uses linear regression (for a continuous trait such as height or BMI) or logistic regression (for a binary trait such as case/control status).

Model (for a continuous trait):

\[Y_i = \beta_0 + \beta\, X_i + \varepsilon_i\]

where for individual \(i\):

\(Y_i\) is the trait value (e.g. height in cm),
\(X_i \in \{0, 1, 2\}\) is the number of copies of the effect allele,
\(\beta_0\) is the intercept (the average trait value for \(X_i = 0\)),
\(\beta\) is the effect size of the SNP — the change in trait value per additional copy of the effect allele,
\(\varepsilon_i\) is the residual error.

Hypotheses:

Null hypothesis \(H_0\): \(\beta = 0\) — the SNP has no effect on the trait.
Alternative hypothesis \(H_A\): \(\beta \neq 0\) — the SNP does affect the trait.

Procedure:

Estimate \(\hat\beta\) and its standard error \(\mathrm{SE}(\hat\beta)\) by ordinary least squares (e.g. R’s lm() or glm()).
Compute the test statistic \(t = \hat\beta / \mathrm{SE}(\hat\beta)\), which is approximately standard normal (or Student’s t) under \(H_0\).
Compute the two-sided p-value = \(\Pr(|Z| \geq |t|)\) under \(H_0\).
Reject \(H_0\) if the p-value is below the chosen significance threshold (see Q5: the threshold is not 0.05 in GWAS).
Repeat the entire procedure independently for every SNP in the data set, producing one p-value per SNP. The final output — one row per SNP with chromosome, position, effect allele, \(\beta\), SE, and p-value — is the summary statistics file.

In practice the regression also includes covariates (age, sex, the first few genetic principal components for ancestry, etc.) to avoid confounding. For binary traits the same idea applies but with logistic regression, in which \(\beta\) is interpreted as a log odds ratio.

Q5. Explain why \(p = 0.05\) is not appropriate in analyzing GWAS data.

A p-value threshold of \(0.05\) means we are willing to accept a 5% false-positive rate for a single test. In a GWAS, however, we run the test not once but once per SNP, and there are typically hundreds of thousands to millions of SNPs.

If we used \(\alpha = 0.05\) and tested, say, 1,000,000 independent SNPs that all truly had no effect (the global null), the expected number of false positives would be:

\[0.05 \times 1{,}000{,}000 = 50{,}000\]

That is, we would declare 50,000 SNPs “significantly” associated with the trait purely by chance — overwhelming any real signal. This is the classical multiple-testing problem.

To control the family-wise error rate (FWER) — the probability of making any false discovery across all tests — at the conventional level of 5%, the Bonferroni correction divides the per-test threshold by the number of tests. For \(\approx\) 1 million independent SNPs this gives the well-known genome-wide significance threshold:

\[\alpha_{\text{GWAS}} = \frac{0.05}{1{,}000{,}000} = 5 \times 10^{-8}\]

Only SNPs with \(p < 5 \times 10^{-8}\) are typically declared genome-wide significant in published GWAS. Using the naive \(p < 0.05\) would flood the result list with false positives and make every published GWAS unreliable.

Q6. Explain why FWER is more conservative than the FDR method.

The two corrections control different error quantities:

FWER (Family-Wise Error Rate) — the probability of making at least one false discovery among all the tests performed. Bonferroni and Holm are FWER-controlling procedures. \[\text{FWER} = \Pr(\text{at least one false positive among all tests})\]
FDR (False Discovery Rate) — the expected proportion of false discoveries among the SNPs we declare significant. Benjamini–Hochberg (BH) is the standard FDR-controlling procedure. \[\text{FDR} = \mathbb{E}\!\left[\frac{\text{number of false positives}}{\text{total number of rejections}}\right]\]

FWER is more conservative because it sets a much higher bar:

FWER guards against even one mistake. If we test 1,000,000 SNPs and want FWER \(\leq 0.05\), Bonferroni requires every individual p-value to beat \(5 \times 10^{-8}\). A SNP with \(p = 10^{-6}\) — extremely unusual under the null — would not be declared significant.
FDR tolerates a controlled fraction of mistakes. With FDR \(\leq 0.05\) we accept that up to 5% of our “discoveries” may be false, as long as the rest are real. If 200 SNPs survive the FDR filter, we expect about 10 of them to be false positives — but the other \(\approx\) 190 should be genuine. The BH cutoff therefore lies between the per-test threshold (0.05) and the Bonferroni threshold (\(5 \times 10^{-8}\)), so more SNPs pass it.
Bonferroni assumes the worst case of perfect dependence (and is exact under independence). It therefore over-corrects when many SNPs are truly associated, or when SNPs are correlated (e.g. via linkage disequilibrium), inflating false negatives.

In short: FWER asks “are we 95% sure that none of these SNPs is a false positive?” while FDR asks “are we 95% sure that the majority (\(\geq 95\%\)) of our hits are real?“. The first question is harder to answer”yes” to, so fewer SNPs pass — that is what we mean by more conservative. The flip side is that FDR has higher statistical power to detect true positives, which is why it is often preferred in exploratory analyses where missing real signals is more costly than allowing a controlled fraction of false ones.

Part 2 — Dataset Description

Source

We use the GWAS summary statistics from the GIANT consortium meta-analysis of body mass index (BMI) published in:

Locke, A. E., Kahali, B., Berndt, S. I., et al. (2015). Genetic studies of body mass index yield new insights for obesity biology. Nature, 518(7538), 197–206. https://doi.org/10.1038/nature14177

The summary statistics are released publicly by the GIANT consortium and can be downloaded from:

http://portals.broadinstitute.org/collaboration/giant/images/1/15/SNP_gwas_mc_merge_nogc.tbl.uniq.gz

(The file is roughly 60 MB compressed and contains \(\approx\) 2.5 million SNPs.)

Reproducibility note. The environment used to author this notebook cannot reach the broadinstitute.org host. The code below first attempts to download the real Locke et al. file when the notebook is knitted in an environment with internet access; if that download fails (offline, firewalled, or in this sandbox) it falls back to a realistic simulated summary-statistics table calibrated to the published Locke et al. study (same column structure, \(\approx\) 1 million SNPs distributed across the 22 autosomes, \(\approx\) 99 truly associated loci embedded in a sea of null SNPs). The downstream analysis — histogram, FWER, FDR, significance calling, CSV export — is identical regardless of which path is taken, so the methodology and the conclusions about FWER vs FDR carry over to the real data.

How the original genotype and phenotype data look

In the original studies that fed into the GIANT meta-analysis, each contributing cohort held two kinds of individual-level data:

Genotype data — for every participant, the alleles at each of several million SNPs, encoded as 0/1/2 (number of copies of the effect allele). Stored in formats such as PLINK .bed/.bim/.fam, VCF, or BGEN. Rows = individuals, columns = SNPs.
Phenotype data — for every participant, age, sex, height, weight, and the derived BMI (= weight in kg / (height in m)\(^2\)), plus the first \(\sim\) 10 genetic principal components used as covariates. Stored as a plain text table, one row per participant.

Each cohort ran a linear-regression GWAS of BMI ~ SNP + age + sex + PCs for every SNP, producing per-cohort summary statistics. The GIANT consortium then meta-analyzed those per-cohort summary statistics across 339,224 individuals of European ancestry to obtain the final, publicly released file used here.

Trait(s) under study

Body Mass Index (BMI) — a continuous anthropometric trait computed as

\[\text{BMI} = \frac{\text{weight (kg)}}{\text{height (m)}^2}\]

BMI is a widely used proxy for adiposity and is strongly related to obesity and the downstream risk of type-2 diabetes, cardiovascular disease, and several cancers. The phenotype was inverse-normal transformed within each cohort to make it approximately normally distributed before regression.

Purpose and motivation

The motivation for the Locke et al. study was to discover common genetic variants (SNPs with minor allele frequency \(\geq\) 1%) that influence BMI, with the ultimate goals of:

Mapping the biology of obesity — pinpointing genes and pathways (neuronal, hypothalamic, lipid-handling) that contribute to body weight regulation.
Increasing statistical power through meta-analysis — the largest single cohort at the time was too small to detect SNPs of realistic small effect (per-allele \(\beta\) on the order of 0.02–0.05 standard deviations of BMI), so combining 80+ studies was essential.
Enabling polygenic prediction — providing effect-size estimates that can be summed into a polygenic risk score for BMI.
Informing therapeutic targets — variants near genes like FTO, MC4R, and TMEM18 nominate well-defined molecular targets for obesity research.

By the end of the analysis the authors identified 97 independent loci reaching genome-wide significance (\(p < 5\times 10^{-8}\)). The CSV table we produce in Part 3 reproduces the same kind of result on our data: the SNPs that survive Bonferroni and FDR correction.

Part 3 — Analysis

3.1 Setup

set.seed(20260513)

We use only base R — no extra packages are required. p.adjust() provides both the Bonferroni (FWER) and Benjamini–Hochberg (FDR) corrections.

3.2 Load the summary statistics into a data frame

We first attempt to download the actual Locke et al. (2015) BMI summary statistics. If that fails (no internet, firewall, or sandbox) we fall back to a realistic simulation calibrated to that study.

GIANT_URL  <- paste0("http://portals.broadinstitute.org/collaboration/giant/",
                     "images/1/15/SNP_gwas_mc_merge_nogc.tbl.uniq.gz")
LOCAL_FILE <- "bmi_gwas_locke2015.tsv.gz"

try_download_real_data <- function(url, local) {
  if (file.exists(local)) return(TRUE)
  tryCatch({
    message("Attempting to download real GIANT BMI summary statistics ...")
    download.file(url, local, mode = "wb", quiet = TRUE)
    message("Downloaded to ", local)
    TRUE
  }, error = function(e) {
    message("Download failed: ", conditionMessage(e))
    FALSE
  }, warning = function(w) {
    message("Download failed: ", conditionMessage(w))
    FALSE
  })
}

simulate_bmi_gwas <- function(n_snps = 1e6, n_true_assoc = 99,
                              seed = 20260513) {
  set.seed(seed)
  chrom_lengths <- c(249, 243, 198, 191, 181, 171, 159, 146, 141, 136,
                     135, 134, 115, 107, 102,  90,  81,  78,  59,  63,  48,  51)
  chrom_weights    <- chrom_lengths / sum(chrom_lengths)
  chrom_assignment <- sample(1:22, n_snps, replace = TRUE, prob = chrom_weights)
  positions <- as.integer(runif(n_snps) *
                          chrom_lengths[chrom_assignment] * 1e6)

  bases         <- c("A", "C", "G", "T")
  effect_allele <- sample(bases, n_snps, replace = TRUE)
  complement    <- c(A = "G", C = "T", G = "A", T = "C")
  other_allele  <- unname(complement[effect_allele])

  maf <- rbeta(n_snps, 0.5, 2.0) * 0.5

  se <- 0.005 / sqrt(pmax(maf * (1 - maf), 0.01) / 0.25)

  beta <- rnorm(n_snps) * se


  true_idx <- sample.int(n_snps, n_true_assoc)
  z_scores <- runif(n_true_assoc, 5.5, 30.0)
  signs    <- sample(c(-1, 1), n_true_assoc, replace = TRUE)
  beta[true_idx] <- signs * z_scores * se[true_idx]


  z     <- beta / se
  pvals <- 2 * pnorm(-abs(z))

  rsids <- paste0("rs", 1000000 + seq_len(n_snps) - 1)

  df <- data.frame(
    SNP  = rsids,
    CHR  = as.integer(chrom_assignment),
    BP   = positions,
    A1   = effect_allele,
    A2   = other_allele,
    EAF  = round(maf, 4),
    BETA = round(beta, 6),
    SE   = round(se, 6),
    P    = pvals,
    N    = 339224L,
    stringsAsFactors = FALSE
  )

  df <- df[sample.int(n_snps), ]
  rownames(df) <- NULL
  df
}

standardize_columns <- function(df) {
  names(df) <- toupper(names(df))
  alias_map <- c(
    "B"            = "BETA",     
    "BETA_HAT"     = "BETA",
    "EFFECT"       = "BETA",
    "FREQ1.HAPMAP" = "EAF",       
    "FREQ1"        = "EAF",
    "FRQ"          = "EAF",
    "FREQ"         = "EAF",
    "MAF"          = "EAF",
    "POS"          = "BP",
    "POSITION"     = "BP",
    "CHROM"        = "CHR",
    "CHROMOSOME"   = "CHR",
    "PVAL"         = "P",
    "P_VALUE"      = "P",
    "P.VALUE"      = "P"
  )
  for (old in names(alias_map)) {
    if (old %in% names(df) && !(alias_map[[old]] %in% names(df))) {
      names(df)[names(df) == old] <- alias_map[[old]]
    }
  }
  df
}

got_real <- try_download_real_data(GIANT_URL, LOCAL_FILE)

if (got_real) {
  message("Loading the real GIANT BMI summary statistics ...")
  df <- read.table(gzfile(LOCAL_FILE), header = TRUE, sep = "",
                   stringsAsFactors = FALSE, check.names = FALSE)
  df <- standardize_columns(df)
  message("Loaded ", format(nrow(df), big.mark = ","), " SNPs from the real GIANT file.")
} else {
  message("Falling back to a realistic simulated summary statistics table ",
          "calibrated to Locke et al. (2015).")
  df <- simulate_bmi_gwas()
  df <- standardize_columns(df)
  message("Generated ", format(nrow(df), big.mark = ","), " simulated SNPs.")
}

cat("Columns available:", paste(names(df), collapse = ", "), "\n\n")

## Columns available: SNP, A1, A2, EAF, BETA, SE, P, N

cat("First 5 rows:\n")

## First 5 rows:

head(df, 5)

##          SNP A1 A2    EAF    BETA     SE       P      N
## 1  rs1000000  G  A 0.6333  0.0001 0.0044 0.98190 231410
## 2 rs10000010  T  C 0.5750 -0.0029 0.0030 0.33740 322079
## 3 rs10000012  G  C 0.1917 -0.0095 0.0054 0.07853 233933
## 4 rs10000013  A  C 0.8333 -0.0095 0.0044 0.03084 233886
## 5 rs10000017  C  T 0.7667 -0.0034 0.0046 0.45980 233146

cat("Shape   :", nrow(df), "rows x", ncol(df), "columns\n")

## Shape   : 2554637 rows x 8 columns

cat("Columns :", paste(names(df), collapse = ", "), "\n\n")

## Columns : SNP, A1, A2, EAF, BETA, SE, P, N

cat("Summary statistics for the p-value column:\n")

## Summary statistics for the p-value column:

summary(df$P)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.232   0.487   0.488   0.745   1.000

3.3 Histogram of p-values

Expectation. Under the global null (no SNP associated with the trait) p-values should be uniformly distributed on \([0,1]\), giving a flat histogram. In a real GWAS most SNPs are null, but a small fraction are truly associated and produce very small p-values. The expected shape is therefore:

An almost-flat bulk across \([0, 1]\) from the null SNPs, plus
A small spike near 0 contributed by the genuinely associated SNPs.

A clearly U-shaped, sloped, or bimodal histogram would suggest problems (uncorrected population structure, genomic inflation, miscalibrated test statistics, …).

par(mfrow = c(1, 2), mar = c(4.5, 4.5, 2.5, 1.5))

hist(df$P, breaks = 50, col = "steelblue", border = "white",
     xlab = "p-value", ylab = "Number of SNPs",
     main = "Distribution of p-values")
abline(h = nrow(df) / 50, col = "red", lty = 2, lwd = 1.2)
legend("topright",
       legend = sprintf("expected under uniform null = %s",
                        format(nrow(df) / 50, big.mark = ",")),
       lty = 2, col = "red", bty = "n", cex = 0.85)

neglog10p <- -log10(pmax(df$P, 1e-300))
h <- hist(neglog10p, breaks = 80, plot = FALSE)
plot(h$mids, pmax(h$counts, 0.5), type = "h", lwd = 4,
     log = "y", col = "darkorange",
     xlab = expression(-log[10](p)),
     ylab = "Number of SNPs (log scale)",
     main = expression(-log[10](p) ~ "distribution (log y-axis)"))
abline(v = -log10(5e-8), col = "red", lty = 2, lwd = 1.2)
legend("topright",
       legend = expression("genome-wide significance " ~ 5 %*% 10^-8),
       lty = 2, col = "red", bty = "n", cex = 0.85)

P-value distribution. Left: uniform-null bulk on the natural scale. Right: -log10(p) on a log y-axis showing the significance tail.

cat("\nNumber of SNPs with p < 5e-8 :", sum(df$P < 5e-8),  "\n")

## 
## Number of SNPs with p < 5e-8 : 1860

cat("Number of SNPs with p < 1e-5 :", sum(df$P < 1e-5),  "\n")

## Number of SNPs with p < 1e-5 : 4444

cat("Number of SNPs with p < 0.05:", format(sum(df$P < 0.05), big.mark = ","), "\n")

## Number of SNPs with p < 0.05: 181,872

Interpretation of the histogram.

The bulk of the distribution on the left panel is essentially flat, matching the uniform-null expectation (red dashed line shows the expected count per bin if every SNP were null). On top of that flat bulk there is a small excess in the first bin — the truly associated SNPs.

The right panel zooms into the tail by plotting \(-\log_{10}(p)\) on a log y-axis. The genome-wide significance threshold \(p = 5 \times 10^{-8}\) corresponds to \(-\log_{10}(p) \approx 7.30\) (red dashed line). The SNPs to the right of this line are the genome-wide significant ones we want the FWER/FDR procedures to recover.

3.4 Apply FWER (Bonferroni) and FDR (Benjamini–Hochberg)

FWER / Bonferroni: reject when \(p_i < \alpha / m\), where \(m\) is the number of tests and \(\alpha = 0.05\).
FDR / Benjamini–Hochberg: sort p-values \(p_{(1)} \le p_{(2)} \le \dots \le p_{(m)}\); find the largest \(k\) such that \(p_{(k)} \le \frac{k}{m}\alpha\); reject \(H_0\) for all SNPs ranked \(\le k\).

R’s base function p.adjust() implements both methods directly.

ALPHA <- 0.05
m     <- nrow(df)

df$P_BONF   <- p.adjust(df$P, method = "bonferroni")  
df$SIG_FWER <- df$P_BONF < ALPHA                   
fwer_threshold <- ALPHA / m                       

df$P_FDR_BH <- p.adjust(df$P, method = "BH")
df$SIG_FDR  <- df$P_FDR_BH < ALPHA

cat(sprintf("Number of SNPs tested (m)               : %s\n",  format(m, big.mark = ",")))

## Number of SNPs tested (m)               : 2,554,637

cat(sprintf("Bonferroni per-test threshold (alpha/m) : %.3e\n", fwer_threshold))

## Bonferroni per-test threshold (alpha/m) : 1.957e-08

cat(sprintf("Conventional GWAS threshold (5e-8)      : 5.000e-08\n\n"))

## Conventional GWAS threshold (5e-8)      : 5.000e-08

cat(sprintf("SNPs passing FWER  (Bonferroni, alpha=0.05): %s\n",
            format(sum(df$SIG_FWER), big.mark = ",")))

## SNPs passing FWER  (Bonferroni, alpha=0.05): 1,688

cat(sprintf("SNPs passing FDR   (BH,         alpha=0.05): %s\n",
            format(sum(df$SIG_FDR),  big.mark = ",")))

## SNPs passing FDR   (BH,         alpha=0.05): 10,595

Note that FDR always recovers at least as many SNPs as FWER (every Bonferroni-significant SNP is also BH-significant). The difference between the two counts directly illustrates the conservatism of FWER discussed in Q6.

3.5 SNPs significantly associated with the trait

We define significant in the strict, publication-ready sense — passing the Bonferroni / FWER correction — and report them sorted by p-value.

significant <- df[df$SIG_FWER, ]
significant <- significant[order(significant$P), ]
rownames(significant) <- NULL

cat("Number of genome-wide significant SNPs (FWER):", nrow(significant), "\n\n")

## Number of genome-wide significant SNPs (FWER): 1688

head(significant, 15)

##           SNP A1 A2    EAF   BETA     SE         P      N     P_BONF SIG_FWER
## 1   rs1558902  A  T 0.4500 0.0818 0.0031 7.51e-153 320073 1.919e-146     TRUE
## 2   rs1421085  C  T 0.4500 0.0813 0.0031 8.83e-151 318182 2.256e-144     TRUE
## 3   rs9937053  A  G 0.4750 0.0797 0.0031 4.23e-145 321110 1.081e-138     TRUE
## 4   rs9936385  C  T 0.4333 0.0795 0.0031 2.73e-144 321627 6.974e-138     TRUE
## 5   rs8050136  A  C 0.4500 0.0780 0.0031 4.42e-144 321930 1.129e-137     TRUE
## 6   rs9935401  A  G 0.4500 0.0780 0.0031 5.58e-144 321909 1.425e-137     TRUE
## 7   rs9939973  A  G 0.4750 0.0792 0.0031 1.99e-143 321184 5.084e-137     TRUE
## 8   rs9940128  A  G 0.4750 0.0792 0.0031 3.45e-143 320981 8.813e-137     TRUE
## 9  rs11075985  A  C 0.4833 0.0790 0.0031 1.36e-142 321197 3.474e-136     TRUE
## 10  rs9923147  T  C 0.4750 0.0790 0.0031 1.36e-142 321145 3.474e-136     TRUE
## 11  rs9923544  T  C 0.4750 0.0790 0.0031 1.79e-142 321112 4.573e-136     TRUE
## 12  rs1121980  A  G 0.4750 0.0790 0.0031 1.80e-142 321255 4.598e-136     TRUE
## 13  rs7193144  C  T 0.4417 0.0784 0.0031 1.76e-140 321890 4.496e-134     TRUE
## 14 rs17817449  G  T 0.4500 0.0781 0.0031 1.54e-139 322064 3.934e-133     TRUE
## 15 rs17817964  T  C 0.4417 0.0779 0.0031 7.77e-139 321602 1.985e-132     TRUE
##      P_FDR_BH SIG_FDR
## 1  1.919e-146    TRUE
## 2  1.128e-144    TRUE
## 3  3.602e-139    TRUE
## 4  1.744e-138    TRUE
## 5  2.258e-138    TRUE
## 6  2.376e-138    TRUE
## 7  7.262e-138    TRUE
## 8  1.102e-137    TRUE
## 9  3.474e-137    TRUE
## 10 3.474e-137    TRUE
## 11 3.832e-137    TRUE
## 12 3.832e-137    TRUE
## 13 3.459e-135    TRUE
## 14 2.810e-134    TRUE
## 15 1.180e-133    TRUE

wanted <- c("SNP", "CHR", "BP", "A1", "A2", "EAF", "BETA", "SE", "P",
            "P_BONF", "P_FDR_BH")
display_cols <- intersect(wanted, names(significant))

top_view <- head(significant[, display_cols, drop = FALSE], 20)
top_view$neglog10P <- -log10(pmax(top_view$P, 1e-300))
top_view

##           SNP A1 A2    EAF   BETA     SE         P     P_BONF   P_FDR_BH
## 1   rs1558902  A  T 0.4500 0.0818 0.0031 7.51e-153 1.919e-146 1.919e-146
## 2   rs1421085  C  T 0.4500 0.0813 0.0031 8.83e-151 2.256e-144 1.128e-144
## 3   rs9937053  A  G 0.4750 0.0797 0.0031 4.23e-145 1.081e-138 3.602e-139
## 4   rs9936385  C  T 0.4333 0.0795 0.0031 2.73e-144 6.974e-138 1.744e-138
## 5   rs8050136  A  C 0.4500 0.0780 0.0031 4.42e-144 1.129e-137 2.258e-138
## 6   rs9935401  A  G 0.4500 0.0780 0.0031 5.58e-144 1.425e-137 2.376e-138
## 7   rs9939973  A  G 0.4750 0.0792 0.0031 1.99e-143 5.084e-137 7.262e-138
## 8   rs9940128  A  G 0.4750 0.0792 0.0031 3.45e-143 8.813e-137 1.102e-137
## 9  rs11075985  A  C 0.4833 0.0790 0.0031 1.36e-142 3.474e-136 3.474e-137
## 10  rs9923147  T  C 0.4750 0.0790 0.0031 1.36e-142 3.474e-136 3.474e-137
## 11  rs9923544  T  C 0.4750 0.0790 0.0031 1.79e-142 4.573e-136 3.832e-137
## 12  rs1121980  A  G 0.4750 0.0790 0.0031 1.80e-142 4.598e-136 3.832e-137
## 13  rs7193144  C  T 0.4417 0.0784 0.0031 1.76e-140 4.496e-134 3.459e-135
## 14 rs17817449  G  T 0.4500 0.0781 0.0031 1.54e-139 3.934e-133 2.810e-134
## 15 rs17817964  T  C 0.4417 0.0779 0.0031 7.77e-139 1.985e-132 1.180e-133
## 16  rs7185735  G  A 0.4417 0.0779 0.0031 7.77e-139 1.985e-132 1.180e-133
## 17  rs8051591  G  A 0.4500 0.0779 0.0031 7.85e-139 2.005e-132 1.180e-133
## 18 rs11075989  T  C 0.4500 0.0778 0.0031 2.29e-138 5.850e-132 3.079e-133
## 19 rs11075990  G  A 0.4500 0.0778 0.0031 2.29e-138 5.850e-132 3.079e-133
## 20  rs9923233  C  G 0.4583 0.0776 0.0031 8.83e-138 2.256e-131 1.128e-132
##    neglog10P
## 1      152.1
## 2      150.1
## 3      144.4
## 4      143.6
## 5      143.4
## 6      143.3
## 7      142.7
## 8      142.5
## 9      141.9
## 10     141.9
## 11     141.7
## 12     141.7
## 13     139.8
## 14     138.8
## 15     138.1
## 16     138.1
## 17     138.1
## 18     137.6
## 19     137.6
## 20     137.1

3.6 Save the significant SNPs into a CSV result table

output_path <- "significant_snps.csv"
write.csv(significant, output_path, row.names = FALSE)

cat(sprintf("Saved %d significant SNPs to: %s\n\n",
            nrow(significant), output_path))

## Saved 1688 significant SNPs to: significant_snps.csv

cat("File preview (first 5 lines):\n")

## File preview (first 5 lines):

cat(paste(rep("-", 70), collapse = ""), "\n")

## ----------------------------------------------------------------------

preview <- readLines(output_path, n = 5)
for (line in preview) cat(line, "\n")

## "SNP","A1","A2","EAF","BETA","SE","P","N","P_BONF","SIG_FWER","P_FDR_BH","SIG_FDR" 
## "rs1558902","A","T",0.45,0.0818,0.0031,7.51e-153,320073,1.918532387e-146,TRUE,1.918532387e-146,TRUE 
## "rs1421085","C","T",0.45,0.0813,0.0031,8.83e-151,318182,2.255744471e-144,TRUE,1.1278722355e-144,TRUE 
## "rs9937053","A","G",0.475,0.0797,0.0031,4.23e-145,321110,1.080611451e-138,TRUE,3.60203817e-139,TRUE 
## "rs9936385","C","T",0.4333,0.0795,0.0031,2.73e-144,321627,6.97415901e-138,TRUE,1.7435397525e-138,TRUE