Before you run this, please remember:
bwa, samtools, picard,
gatk, cnvkit, and vep.sorted.bam files you have. The initial steps will
create analysis-ready BAM files (.final.bam).knitr::purl to extract the shell commands
into a script if you prefer to run it directly in your terminal.Here is the R Markdown file content:
This document provides a comprehensive, end-to-end bioinformatics
pipeline to process and analyze Whole Exome Sequencing (WES) data for
Patient 1. The dataset includes a matched normal sample
(N1) and three tumor/cell line samples (T1,
X1, XX1).
The primary goals of this analysis are: 1. To perform quality control
and pre-processing on the existing BAM files. 2. To call somatic single
nucleotide variants (SNVs) and insertions/deletions (indels). 3. To
identify somatic copy number variations (CNVs). 4. To produce
visualizations that can be integrated with and used to validate findings
from single-cell RNA sequencing analyses (inferCNV,
scevan).
This pipeline follows GATK Best Practices for somatic variant discovery and uses CNVkit for CNV detection.
This chunk defines all the necessary paths for tools, reference genomes, and data. You must edit these paths to match your system’s configuration before running any part of this pipeline.
#!/bin/bash
# -----------------------------------------------------------------
# USER-DEFINED VARIABLES: PLEASE EDIT THESE PATHS
# -----------------------------------------------------------------
# Base directory where N1, T1, X1, XX1 folders are located
BASE_DIR="~/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1"
# Reference genome files (must be indexed with .fai and .dict)
REF_GENOME="/path/to/your/reference/hg38.fa"
# Known sites VCFs for GATK
DBSNP="/path/to/your/dbsnp_138.hg38.vcf.gz"
GNOMAD="/path/to/your/af-only-gnomad.hg38.vcf.gz"
# Interval list for the exome capture kit used
# This is CRITICAL for accurate CNV analysis and targeted variant calling.
INTERVAL_LIST="/path/to/your/exome_capture_kit.bed" # Can be in BED or Picard interval_list format
# Software paths (if not in system PATH, provide full path)
GATK="gatk"
PICARD="picard"
CNVKIT="cnvkit.py"
VEP_PATH="/path/to/your/ensembl-vep" # Path to VEP directory
# Create output directories
mkdir -p $BASE_DIR/01_preprocessed_bams
mkdir -p $BASE_DIR/02_mutect2_results
mkdir -p $BASE_DIR/03_cnvkit_results
mkdir -p $BASE_DIR/04_visualizations
The provided sorted.bam files are the starting point. We
will perform duplicate marking and Base Quality Score Recalibration
(BQSR) to generate analysis-ready BAM files.
Duplicates arise from PCR amplification during library preparation
and can bias variant calls. We use Picard’s MarkDuplicates
to identify and remove them.
#!/bin/bash
# Load variables from the configuration chunk
source <(sed -n '/^#/,/USER-DEFINED/d;/^```{bash, chunk_name="configuration"}/,/^```/p' wes_analysis.Rmd | grep -v "```")
for SAMPLE in N1 T1 X1 XX1; do
echo "--- Marking duplicates for $SAMPLE ---"
INPUT_BAM="$BASE_DIR/$SAMPLE/sorted.bam"
OUTPUT_BAM="$BASE_DIR/01_preprocessed_bams/${SAMPLE}.dedup.bam"
METRICS_FILE="$BASE_DIR/01_preprocessed_bams/${SAMPLE}.dedup_metrics.txt"
if [ -f "$OUTPUT_BAM" ]; then
echo "$OUTPUT_BAM already exists, skipping."
continue
fi
$PICARD MarkDuplicates \
I=$INPUT_BAM \
O=$OUTPUT_BAM \
M=$METRICS_FILE \
REMOVE_DUPLICATES=true \
CREATE_INDEX=true \
VALIDATION_STRINGENCY=SILENT
done
BQSR adjusts base quality scores to correct for systematic technical errors from the sequencing process, improving variant call accuracy.
#!/bin/bash
# Load variables from the configuration chunk
source <(sed -n '/^#/,/USER-DEFINED/d;/^```{bash, chunk_name="configuration"}/,/^```/p' wes_analysis.Rmd | grep -v "```")
for SAMPLE in N1 T1 X1 XX1; do
echo "--- Running BQSR for $SAMPLE ---"
DEDUP_BAM="$BASE_DIR/01_preprocessed_bams/${SAMPLE}.dedup.bam"
RECAL_TABLE="$BASE_DIR/01_preprocessed_bams/${SAMPLE}.recal_data.table"
FINAL_BAM="$BASE_DIR/01_preprocessed_bams/${SAMPLE}.final.bam"
if [ -f "$FINAL_BAM" ]; then
echo "$FINAL_BAM already exists, skipping."
continue
fi
# Step 1: Generate recalibration table
$GATK BaseRecalibrator \
-R $REF_GENOME \
-I $DEDUP_BAM \
--known-sites $DBSNP \
-O $RECAL_TABLE
# Step 2: Apply recalibration
$GATK ApplyBQSR \
-R $REF_GENOME \
-I $DEDUP_BAM \
--bqsr-recal-file $RECAL_TABLE \
-O $FINAL_BAM
done
We use GATK’s Mutect2 in tumor-normal mode to call somatic SNVs and
indels for each tumor sample (T1, X1,
XX1) against the matched normal (N1).
#!/bin/bash
# Load variables from the configuration chunk
source <(sed -n '/^#/,/USER-DEFINED/d;/^```{bash, chunk_name="configuration"}/,/^```/p' wes_analysis.Rmd | grep -v "```")
NORMAL_BAM="$BASE_DIR/01_preprocessed_bams/N1.final.bam"
for TUMOR in T1 X1 XX1; do
echo "--- Calling somatic variants for $TUMOR vs N1 ---"
TUMOR_BAM="$BASE_DIR/01_preprocessed_bams/${TUMOR}.final.bam"
VCF_OUT="$BASE_DIR/02_mutect2_results/${TUMOR}.somatic.raw.vcf.gz"
if [ -f "$VCF_OUT" ]; then
echo "$VCF_OUT already exists, skipping."
continue
fi
$GATK Mutect2 \
-R $REF_GENOME \
-I $TUMOR_BAM \
-I $NORMAL_BAM \
-normal N1 \
--germline-resource $GNOMAD \
-O $VCF_OUT
done
Raw variant calls must be filtered to remove false positives and then annotated to predict their functional effects.
FilterMutectCalls applies a series of filters to the raw
VCF from Mutect2 to increase precision.
#!/bin/bash
# Load variables from the configuration chunk
source <(sed -n '/^#/,/USER-DEFINED/d;/^```{bash, chunk_name="configuration"}/,/^```/p' wes_analysis.Rmd | grep -v "```")
for TUMOR in T1 X1 XX1; do
echo "--- Filtering variant calls for $TUMOR ---"
RAW_VCF="$BASE_DIR/02_mutect2_results/${TUMOR}.somatic.raw.vcf.gz"
FILTERED_VCF="$BASE_DIR/02_mutect2_results/${TUMOR}.somatic.filtered.vcf.gz"
if [ -f "$FILTERED_VCF" ]; then
echo "$FILTERED_VCF already exists, skipping."
continue
fi
$GATK FilterMutectCalls \
-R $REF_GENOME \
-V $RAW_VCF \
-O $FILTERED_VCF
done
We use Ensembl Variant Effect Predictor (VEP) to annotate the filtered variants with information about genes, transcripts, and predicted protein-level consequences.
#!/bin/bash
# Load variables from the configuration chunk
source <(sed -n '/^#/,/USER-DEFINED/d;/^```{bash, chunk_name="configuration"}/,/^```/p' wes_analysis.Rmd | grep -v "```")
for TUMOR in T1 X1 XX1; do
echo "--- Annotating variants for $TUMOR with VEP ---"
FILTERED_VCF="$BASE_DIR/02_mutect2_results/${TUMOR}.somatic.filtered.vcf.gz"
ANNOTATED_VCF="$BASE_DIR/02_mutect2_results/${TUMOR}.somatic.annotated.vcf"
if [ -f "$ANNOTATED_VCF" ]; then
echo "$ANNOTATED_VCF already exists, skipping."
continue
fi
# Run VEP. Assumes VEP cache is installed.
$VEP_PATH/vep \
--input_file $FILTERED_VCF \
--output_file $ANNOTATED_VCF \
--format vcf \
--vcf \
--symbol \
--terms SO \
--tsl \
--hgvs \
--fasta $REF_GENOME \
--cache \
--force_overwrite
done
We use CNVkit to detect somatic copy number changes. It
builds a reference profile from the normal sample and then calls gains
and losses in the tumor samples relative to that reference.
#!/bin/bash
# Load variables from the configuration chunk
source <(sed -n '/^#/,/USER-DEFINED/d;/^```{bash, chunk_name="configuration"}/,/^```/p' wes_analysis.Rmd | grep -v "```")
echo "--- Running CNVkit pipeline ---"
NORMAL_BAM="$BASE_DIR/01_preprocessed_bams/N1.final.bam"
REFERENCE_CNN="$BASE_DIR/03_cnvkit_results/reference.cnn"
# Step 4.1: Build the reference from the normal sample
# In a real-world scenario, a pool of normals is better, but one is sufficient.
$CNVKIT batch $NORMAL_BAM \
--normal \
--targets $INTERVAL_LIST \
--fasta $REF_GENOME \
--output-reference $REFERENCE_CNN \
--output-dir $BASE_DIR/03_cnvkit_results/N1
# Step 4.2: Analyze each tumor sample against the reference
for TUMOR in T1 X1 XX1; do
echo "--- Calling CNVs for $TUMOR ---"
TUMOR_BAM="$BASE_DIR/01_preprocessed_bams/${TUMOR}.final.bam"
$CNVKIT batch $TUMOR_BAM \
--reference $REFERENCE_CNN \
--output-dir $BASE_DIR/03_cnvkit_results/$TUMOR
done
This final step generates plots that are crucial for comparing WES
results with your scRNA-seq inferCNV and
scevan outputs.
A heatmap provides a genome-wide overview of copy number changes across all tumor samples, making it easy to spot shared alterations.
#!/bin/bash
# Load variables from the configuration chunk
source <(sed -n '/^#/,/USER-DEFINED/d;/^```{bash, chunk_name="configuration"}/,/^```/p' wes_analysis.Rmd | grep -v "```")
echo "--- Generating CNV heatmap ---"
$CNVKIT heatmap \
$BASE_DIR/03_cnvkit_results/T1/*.cns \
$BASE_DIR/03_cnvkit_results/X1/*.cns \
$BASE_DIR/03_cnvkit_results/XX1/*.cns \
-o $BASE_DIR/04_visualizations/CNV_heatmap.pdf
Interpretation: Compare the
CNV_heatmap.pdf with your inferCNV heatmap.
Consistent chromosomal gains (red) and losses (blue) between the bulk
WES and single-cell data provide strong validation. This is key to
confirming that the CNV events you see in the single cells are real and
not computational artifacts.
An OncoPrint is the standard way to visualize the landscape of somatic mutations across a cohort. The following R code chunk provides a template for creating one. You will need to parse your annotated VCF files to create the input matrix.
Note: This is an R code chunk. You will need to run
this in an R environment with the ComplexHeatmap library
installed (install.packages("ComplexHeatmap")).
# This chunk is set to 'eval=FALSE'.
# You must run this code manually in your R console after installing packages.
# install.packages("vcfR")
# install.packages("ComplexHeatmap")
# BiocManager::install("ComplexHeatmap") # If not on CRAN
library(vcfR)
library(ComplexHeatmap)
# --- 1. Define a function to parse VCF and extract key mutations ---
# This is a simplified parser. You may want to add more filters (e.g., by allele frequency).
parse_mutect_vcf <- function(vcf_file, sample_name) {
vcf <- read.vcfR(vcf_file)
# Keep only PASS variants
pass_vcf <- vcf[getFILTER(vcf) == "PASS", ]
# Extract gene symbol and consequence from INFO field (VEP annotation)
# The 'ANN' field format can vary, so you might need to adjust this.
# Example format: Allele|Consequence|IMPACT|SYMBOL|...
ann <- extract.info(pass_vcf, "ANN", as.is = TRUE)
# A simple loop to get the most severe consequence per variant
results <- lapply(ann, function(variant_ann) {
# Split multiple annotations
annotations <- strsplit(variant_ann, ",")[[1]]
# Get the first annotation (often the most canonical)
fields <- strsplit(annotations[1], "\\|")[[1]]
consequence <- fields[2]
gene <- fields[4]
# Simplify consequences
mut_type <- "other"
if (grepl("missense", consequence)) mut_type <- "missense"
if (grepl("frameshift|stop_gained|splice", consequence)) mut_type <- "truncating"
return(data.frame(gene = gene, mut_type = mut_type, sample = sample_name))
})
# Combine results and remove empty entries
df <- do.call(rbind, results)
df <- df[df$gene != "", ]
return(df)
}
# --- 2. Process each VCF file ---
vcf_dir <- "02_mutect2_results" # Relative to your R project directory
t1_muts <- parse_mutect_vcf(file.path(vcf_dir, "T1.somatic.annotated.vcf"), "T1")
x1_muts <- parse_mutect_vcf(file.path(vcf_dir, "X1.somatic.annotated.vcf"), "X1")
xx1_muts <- parse_mutect_vcf(file.path(vcf_dir, "XX1.somatic.annotated.vcf"), "XX1")
all_muts <- rbind(t1_muts, x1_muts, xx1_muts)
# --- 3. Create the matrix for OncoPrint ---
# Get unique genes and samples
genes <- unique(all_muts$gene)
samples <- c("T1", "X1", "XX1")
# Create an empty matrix
mat <- matrix("", nrow = length(genes), ncol = length(samples),
dimnames = list(genes, samples))
# Populate the matrix
for (i in 1:nrow(all_muts)) {
mat[all_muts$gene[i], all_muts$sample[i]] <- all_muts$mut_type[i]
}
# Optional: Filter for top mutated genes to make the plot readable
# For example, keep genes mutated in at least 2 samples
num_mutated <- rowSums(mat != "")
mat_filtered <- mat[num_mutated >= 1, ] # Keep genes mutated in at least one sample
# --- 4. Define the plotting style and create the OncoPrint ---
col <- c("truncating" = "maroon", "missense" = "royalblue")
alter_fun <- list(
background = function(x, y, w, h) {
grid.rect(x, y, w-unit(2, "pt"), h-unit(2, "pt"),
gp = gpar(fill = "#e0e0e0", col = NA))
},
truncating = function(x, y, w, h) {
grid.rect(x, y, w-unit(2, "pt"), h-unit(2, "pt"),
gp = gpar(fill = col["truncating"], col = NA))
},
missense = function(x, y, w, h) {
grid.rect(x, y, w-unit(2, "pt"), h*0.4,
gp = gpar(fill = col["missense"], col = NA))
}
)
# Generate the plot
oncoPrint(mat_filtered,
alter_fun = alter_fun,
col = col,
row_order = sort(rownames(mat_filtered)),
column_title = "Somatic Mutation Landscape in Patient 1 Samples")
# You can save this plot using ggsave or pdf() functions.
Interpretation: The OncoPrint will clearly show which genes are mutated in T1, X1, and XX1. This is your primary tool for confirming clonal relationships. If X1 and XX1 are indeed cell lines derived from Patient 1’s tumor (T1), they should share a substantial number of key driver mutations. ```