Declaration: It is important to note that the dataset used in this example is solely for demonstration purposes and does not carry any specific biological implications. Consequently, the results obtained from the SCAMPI analysis should not be interpreted as having biological significance.
The Genetic Data Structure (GDS) is a specialized format designed for storing genomic data efficiently, offering rapid access to specific data subsets. To leverage the benefits of GDS, genetic data must be converted into this format.
For those looking to convert VCF files to GDS, a detailed tutorial is
available at the UW SISG 2021 GDS Format Guide.
Additionally, the snpgdsBED2GDS function in the package
SNPRelate offers a way to convert PLINK files to GDS. More
information about this function can be found on the SNPRelate’s Documentation
Page.
Before proceeding to convert to the GDS format, it is required to
install the BiocManager package. This package is required for installing
other necessary packages such as SeqArray, and
SNPRelate. Below is the code to install all these required
packages for SCAMPI using BiocManager. You need to uncomment the code to
install all the packages required by SCAMPI.
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("SeqArray")
library(SeqArray)
# BiocManager::install("SNPRelate")
library(SNPRelate)
# install.packages("gdata")
library(gdata)
# install.packages("dglm")
library(dglm)
# install.packages("lubridate")
library(lubridate)
# install.packages("tidyverse")
library(tidyverse)
# install.packages("qqman")
library(qqman)Then let’s request SCAMPI.
The dataset, named trait, is a dataframe consisting of
1,126 rows and 11 columns. The initial column, subject_id, uniquely
identifies each participant. Following this are four columns providing
demographic details for the 1,126 subjects, which include sex, age,
height (in centimeters), and the study (specific study cohort they
belong to). The remaining six columns, labeled trait_1 through trait_6,
correspond to six different quantitative traits.
head(trait)
#> subject_id sex age height study trait_1 trait_2 trait_3
#> 1 HG00096 M 47 165.3 study_1 2.79034687 1.70004346 3.0952465
#> 2 HG00102 F 49 169.1 study_1 3.22659839 -2.05812375 0.1992559
#> 3 HG00112 M 46 167.9 study_1 4.22466800 -0.04550758 -0.8016060
#> 4 HG00114 M 49 169.5 study_1 3.64999061 5.54232754 2.0862986
#> 5 HG00115 M 35 161.1 study_1 4.96493849 1.84596132 3.1227874
#> 6 HG00116 M 37 182.2 study_1 0.05729449 -0.05469801 0.6482322
#> trait_4 trait_5 trait_6
#> 1 -2.7304511 11.777443 -8.128874
#> 2 -0.2594509 6.614053 -2.075008
#> 3 0.9910414 9.011473 -5.812822
#> 4 -2.4230308 10.121841 -5.221409
#> 5 -2.6454378 7.086742 -2.042744
#> 6 -1.3573220 8.173591 -2.495327For demonstration purposes, the genotype is limited to a small
subset of SNPs from chromosome 1 and has been converted
into the GDS format, named 1KG_phase3_subset_chr1.gds.
# open a connection to the GDS file
SNP_data <- seqOpen("../data/1KG_phase3_subset_chr1.gds")
SNP_data
#> Object of class "SeqVarGDSClass"
#> File: /Users/shijiabian/Dropbox/SCAMPI/data/1KG_phase3_subset_chr1.gds (70.5K)
#> + [ ] *
#> |--+ description [ ] *
#> |--+ sample.id { Str8 1126 LZMA_ra(9.66%), 877B } *
#> |--+ variant.id { Int32 1120 LZMA_ra(17.5%), 793B } *
#> |--+ position { Int32 1120 LZMA_ra(78.5%), 3.4K } *
#> |--+ chromosome { Str8 1120 LZMA_ra(4.55%), 109B } *
#> |--+ allele { Str8 1120 LZMA_ra(26.0%), 1.2K } *
#> |--+ genotype [ ] *
#> | |--+ data { Bit2 2x1126x1121 LZMA_ra(8.34%), 51.4K } *
#> | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
#> | \--+ extra { Int16 0 LZMA_ra, 18B }
#> |--+ phase [ ]
#> | |--+ data { Bit1 1126x1120 LZMA_ra(0.11%), 177B } *
#> | |--+ extra.index { Int32 3x0 LZMA_ra, 18B } *
#> | \--+ extra { Bit1 0 LZMA_ra, 18B }
#> |--+ annotation [ ]
#> | |--+ id { Str8 1120 LZMA_ra(40.4%), 3.6K } *
#> | |--+ qual { Float32 1120 LZMA_ra(2.46%), 117B } *
#> | |--+ filter { Int32,factor 1120 LZMA_ra(2.46%), 117B } *
#> | |--+ info [ ]
#> | \--+ format [ ]
#> \--+ sample.annotation [ ]
seqClose(SNP_data)The data sets utilized in this tutorial are sourced from the UW SISG 2021 GDS Sample Data.
The objective at this step is to format the variables appropriately.
The variable sex will be coded as a binary variable. Given
that the study variable comprises three categories, it will
be encoded into two columns.
trait$sex_F <- 0
trait$sex_F[trait$sex == "M"] <- 1
trait$study_2 <- 0
trait$study_3 <- 0
trait$study_2[trait$study == "study_2"] <- 1
trait$study_3[trait$study == "study_3"] <- 1As a reminder that checking for the normal distribution of trait outcomes (as represented by the six traits in this dataset) is a standard practice, which is to ensure that the residuals are normally distributed when conducting regression analysis. If the trait outcomes are not normally distributed, applying an inverse normal transformation is one common method used to transform the data. Other approaches can also be considered if needed. In our data, the six traits are approximately normally distributed, so we do not need further transformation.
The primary objective of this exercise is to show the users regarding how to employ SCAMPI to investigate the presence of G x E (Gene by Environment) or G x G (Gene by Gene) interaction effects pertaining to at least one of these six traits at chromosome 1. While sex, age, height, and study cohort are treated as confounding factors in this analysis, you can apply your expertise in selecting alternative variables as either confounders or traits for your personal practice.
The process begins with establishing a connection to the GDS file,
accomplished by using the seqGetData function. Following
this, we are able to extract both the SNP ID and the subject ID from the
GDS file.
Pheno_Outcome <- "Trait1-6"
Confounder <- "Age; Gender; Height; Study ID"
N_pheno <- 6
# open a connection to the GDS file
SNP_data <- seqOpen("../data/1KG_phase3_subset_chr1.gds")
# Get the SNP ID that will be included
variant.id <- seqGetData(SNP_data, "variant.id")
SNV.id <- variant.id
# Select the sample/participant ID
sample.id.sliced <- trait$subject_idTo store the final p-values, we will create some empty dataframes.
The dataframe, named df_res, is designated for storing the
aggregated p-values from both SCAMPI. It includes initial columns
(variant_id to alt) that uniquely identify each SNP. The column
p_value is reserved for the final p-values, while the
method column indicates whether SCAMPI or other method,
such as Multivariate-Levene’s test.
Additionally, we have another dataframe:
df_scampi_single_pval. This dataframe stores the individual
p-values associated with each variance or covariance element for the
SCAMPI. For each SNP, SCAMPI produces P choose 2 plus P p-values for P
traits. In this example, it will generate 21 columns to store 21
p-values from the 6 traits.
# Aggregated p-value
df_res <- data.frame(
variant_id = double(),
chrom = double(),
pos = character(),
ref = character(),
alt = character(),
N = double(),
N_complete = double(),
N_pheno = double(),
Pheno_Outcome = character(),
maf_N_complete = double(),
Confounder = character(),
method = character(),
p_value = double(),
time = double(),
stringsAsFactors = FALSE
)
# SCAMPI single p-value
df_scampi_single_pval <- data.frame(
variant_id = double(),
chrom = double(),
pos = character(),
ref = character(),
alt = character(),
N = double(),
N_complete = double(),
N_pheno = double(),
Pheno_Outcome = character(),
maf_N_complete = double(),
Confounder = character(),
method = character(),
p_value = double(),
stringsAsFactors = FALSE
)
df_scampi_single_pval <- cbind(df_scampi_single_pval, replicate(choose(N_pheno, 2) + N_pheno - 1, df_scampi_single_pval$p_value))
names(df_scampi_single_pval)[(length(names(df_scampi_single_pval)) - choose(N_pheno, 2) - N_pheno + 1):length(names(df_scampi_single_pval))] <- paste("pval_", c(1:(choose(N_pheno, 2) + N_pheno)), sep = "")The seqSetFilter function allows us to filter the data
to include only the intersection of chosen subjects and/or SNPs. This is
particularly useful for segmenting the long sequence of SNPs into
smaller subsets, facilitating the submission to multiple job arrays when
running SCAMPI on High-Performance Computing (HPC) systems. In this
example, given the small sample size, we can conveniently filter by all
SNP IDs and sample IDs.
# Open the filter
seqSetFilter(SNP_data,
variant.id = SNV.id,
sample.id = sample.id.sliced
)
#> # of selected samples: 1,126
#> # of selected variants: 1,120
# The sample ID ordered as shown in genotype dataset
sample_id_sliced <- seqGetData(SNP_data, "sample.id")
sample_id_sliced_df <- data.frame(sample_id_sliced)
names(sample_id_sliced_df) <- "subject_id"
chrom <- seqGetData(SNP_data, "chromosome")
pos <- seqGetData(SNP_data, "position")
ref <- seqGetData(SNP_data, "$ref")
alt <- seqGetData(SNP_data, "$alt")Finally, we iterate over each SNP, applying the SCAMPI method using
the SCAMPI function. The code contains detailed comments
outlining each step of these procedures.
One additional note is, the Multivariate-Levene’s test, a new method
developed from the single-variate Levene’s test, is also implemented by
our pacckage through the multivariate_levene_test
function.
for (genom in 1:length(SNV.id)) {
# Prep 1: Order the phenotype data according to the order of the genotype data
joint_sliced <- sample_id_sliced_df %>%
left_join(., trait)
# Prep 2: Prepare Genotypes(G), add G to dataframe for selecting complete data
dos <- seqGetData(SNP_data, "$dosage")
X <- dos[, genom]
One_X_dat <- cbind(joint_sliced, X)
Complete_One_X_dat <- One_X_dat[complete.cases(One_X_dat), ]
temp_maf <- min(
sum(Complete_One_X_dat$X) / (length(Complete_One_X_dat$X) * 2),
1 - sum(Complete_One_X_dat$X) / (length(Complete_One_X_dat$X) * 2)
)
# Prep 3: Prepare Phenotypic Outcomes
Y_df <- as.data.frame(Complete_One_X_dat[, names(Complete_One_X_dat) %in% c("trait_1", "trait_2", "trait_3", "trait_4", "trait_5", "trait_6")])
sim_Y <- as.matrix(Y_df)
# Prep 4: Prepare Genotype
X <- as.matrix(as.data.frame(Complete_One_X_dat$X))
# Prep 5: Prepare Confounders
Z_con <- as.matrix(as.data.frame(Complete_One_X_dat[, names(Complete_One_X_dat) %in% c("sex_F", "age", "height", "study_2", "study_3")]))
if (dim(sim_Y)[2] != N_pheno) {
stop("The second dimension of Y is wrong!")
}
# Aplly SCAMPI
SCAMPI_START <- Sys.time()
SCAMPI_result <- SCAMPI(
y = sim_Y,
g_var = X,
z_var = Z_con
)
SCAMPI_pval <- SCAMPI_result$aggregate_pvalue
pval_single_SCAMPI <- SCAMPI_result$single_pvalue
SCAMPI_END <- Sys.time()
SCAMPI_time <- as.numeric(SCAMPI_END - SCAMPI_START, units = "secs")
df_scampi <- data.frame(
variant_id = genom,
chrom = as.numeric(chrom[genom]),
pos = toString(pos[genom]),
ref = toString(ref[genom]),
alt = toString(alt[genom]),
N = nrow(joint_sliced),
N_complete = nrow(Complete_One_X_dat),
N_pheno = N_pheno,
Pheno_Outcome = Pheno_Outcome,
maf_N_complete = temp_maf,
Confounder = Confounder,
method = "SCAMPI",
p_value = SCAMPI_pval,
time = SCAMPI_time,
stringsAsFactors = FALSE
)
df_res <- rbind(
df_res,
df_scampi
)
# Save single p-value
df_scampi_single <- data.frame(
variant_id = genom,
chrom = as.numeric(chrom[genom]),
pos = toString(pos[genom]),
ref = toString(ref[genom]),
alt = toString(alt[genom]),
N = nrow(joint_sliced),
N_complete = nrow(Complete_One_X_dat),
N_pheno = N_pheno,
Pheno_Outcome = Pheno_Outcome,
maf_N_complete = temp_maf,
Confounder = Confounder,
method = "SCAMPI",
stringsAsFactors = FALSE
)
for (i in 1:(choose(N_pheno, 2) + N_pheno)) {
df_scampi_single[paste("pval_", i, sep = "")] <- pval_single_SCAMPI[i]
}
df_scampi_single_pval <- rbind(df_scampi_single_pval, df_scampi_single)
if (genom %% 100 == 0) {
print(genom)
}
}
#> [1] 100
#> [1] 200
#> [1] 300
#> [1] 400
#> [1] 500
#> [1] 600
#> [1] 700
#> [1] 800
#> [1] 900
#> [1] 1000
#> [1] 1100
seqClose(SNP_data)You have the flexibility to choose your preferred method for
visualizing the p-values, such as using a Manhattan plot. However, it’s
advisable to perform some Quality Control (QC) filtering beforehand to
exclude SNPs that don’t meet QC standards. This QC step can be applied
either before using seqSetFilter or after analyzing all
SNPs using SCAMPI. For example, we can eliminate less common SNPs with a
Minor Allele Frequency (MAF) below 0.05 and those with more than 10%
missing data. Additionally, we create a new column named SNP to
represent the RS ID. This ID, which is absent in our dataset but
necessary for generating a Manhattan plot using the
manhattan function, is thus artificially generated for
plotting purposes. The real genotype data, such as UKBB, should have the
correct RS ID label.
df_res$SNP <- paste("rs", df_res$variant_id, sep = "")
df_res_qc <- df_res %>% filter(maf_N_complete >= 0.05, N_complete >= 0.9 * 1126)The manhattan function from the qqman
package can be utilized to generate a Manhattan plot. This particular
visualization will focus exclusively on a subset of SNPs located on the
first chromosome.