This document provides the definitive end-to-end bioinformatics pipeline to analyze the Whole Exome Sequencing (WES) data for Patient 1 and its derived cell lines.
Key Findings & Strategy: 1. The original WES BAM
files are missing critical @RG (Read Group) sample name
information, and the gnomAD reference file has an incompatible
chromosome naming scheme. 2. The scRNA-seq data was processed on
hg38, while the WES data is on hg19. 3.
Therefore, the correct strategy is: a. First, create new, corrected BAM
files with the proper @RG headers. b. Second, create a new,
corrected gnomAD file with the proper chromosome names and index.
c. Perform all WES analyses in hg19 using these new,
corrected files. d. Use liftOver to convert the final CNV
results to hg38 for direct comparison.
Sample Information: * N1: Normal (Constitutional DNA) * T1: Primary Tumor (Fresh Sézary Cells) * X1: In-vivo derived Cell Line (from Mouse) * XX1: Secondary Cell Line (cultured from X1)
This chunk defines all necessary paths and global variables for the entire pipeline.
#!/bin/bash
# Activate the conda environment to make its tools available
source /home/bioinfo/miniconda3/etc/profile.d/conda.sh
conda activate wes_env
# --- DEFINE GLOBAL VARIABLES ---
BASE_DIR="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1"
REF_GENOME="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/hg19/hg19.fa"
INTERVAL_LIST="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/reference/PJ1610240_covered.bed"
LIFTOVER_CHAIN="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/hg19/hg19ToHg38.over.chain.gz"
VEP_PATH="/home/bioinfo/tools/ensembl-vep/vep"
JAVA_OPTS="-Xmx16g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true"
# --- This is the path to the gnomAD file we will CREATE in the prepare_gnomad step ---
GNOMAD_HG19_CORRECTED="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/reference/gnomad.exomes.r2.0.2.sites.chr.vcf.gz"
# --- OUTPUT DIRECTORIES ---
mkdir -p $BASE_DIR/hg19_analysis/bams_with_rg
mkdir -p $BASE_DIR/hg19_analysis/cnv_results
mkdir -p $BASE_DIR/hg19_analysis/somatic_variants
mkdir -p $BASE_DIR/hg19_analysis/liftover_results
mkdir -p $BASE_DIR/hg19_analysis/visualizations
echo "Configuration loaded and output directories created."
This step fixes the missing @RG (Read Group) headers in
the original BAM files. This is a one-time operation that creates new,
analysis-ready BAMs.
#!/bin/bash
# Activate the conda environment
source /home/bioinfo/miniconda3/etc/profile.d/conda.sh
conda activate wes_env
# --- Define Paths ---
BASE_DIR="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1"
OUTPUT_BAM_DIR="$BASE_DIR/hg19_analysis/bams_with_rg"
# --- Loop through each sample to add Read Groups ---
for SAMPLE_ID in N1 T1 X1 XX1; do
INPUT_BAM="$BASE_DIR/$SAMPLE_ID/sorted.bam"
OUTPUT_BAM="$OUTPUT_BAM_DIR/${SAMPLE_ID}.with_rg.bam"
if [ -f "$OUTPUT_BAM" ]; then
echo "Corrected BAM for $SAMPLE_ID already exists. Skipping."
continue
fi
echo "--- Adding Read Group to $SAMPLE_ID ---"
picard AddOrReplaceReadGroups \
I="$INPUT_BAM" \
O="$OUTPUT_BAM" \
RGID="$SAMPLE_ID" \
RGLB="lib1" \
RGPL="illumina" \
RGPU="unit1" \
RGSM="$SAMPLE_ID" \
VALIDATION_STRINGENCY=LENIENT # <--- THE CRITICAL FIX
samtools index "$OUTPUT_BAM"
done
echo "BAM file preparation complete. New files are in $OUTPUT_BAM_DIR"
This step corrects the chromosome naming scheme in the gnomAD VCF file to be compatible with GATK.
This step ensures that all the newly created BAM files have a
corresponding .bai index, which is a strict requirement for
GATK.
#!/bin/bash
# Activate the conda environment
source /home/bioinfo/miniconda3/etc/profile.d/conda.sh
conda activate wes_env
# --- Define Paths ---
BASE_DIR="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1"
BAM_DIR="$BASE_DIR/hg19_analysis/bams_with_rg"
# --- Loop through each corrected BAM to create the index ---
for SAMPLE_ID in N1 T1 X1 XX1; do
BAM_FILE="$BAM_DIR/${SAMPLE_ID}.with_rg.bam"
# Check if the index already exists
if [ -f "${BAM_FILE}.bai" ]; then
echo "Index for $SAMPLE_ID already exists. Skipping."
else
echo "--- Indexing corrected BAM for $SAMPLE_ID ---"
samtools index "$BAM_FILE"
fi
done
echo "BAM file indexing complete."
This one-time step ensures the gnomAD VCF uses the chr
prefix and has the correct .tbi index for GATK.
#!/bin/bash
# Activate the conda environment
source /home/bioinfo/miniconda3/etc/profile.d/conda.sh
conda activate wes_env
# --- Define Paths ---
REFERENCE_DIR="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/reference"
OLD_GNOMAD="$REFERENCE_DIR/gnomad.exomes.r2.0.2.sites.vcf.gz"
NEW_GNOMAD="$REFERENCE_DIR/gnomad.exomes.r2.0.2.sites.chr.vcf.gz"
MAP_FILE="$REFERENCE_DIR/chr_name_map.txt"
if [ -f "$NEW_GNOMAD" ] && [ -f "${NEW_GNOMAD}.tbi" ]; then
echo "Corrected gnomAD file and its .tbi index already exist. Skipping preparation."
else
echo "Corrected gnomAD file or its index not found. Starting preparation..."
if [ ! -f "$MAP_FILE" ]; then
echo "Creating chromosome name map file: $MAP_FILE"
for i in {1..22} X Y M; do echo "$i chr$i"; done > "$MAP_FILE"
fi
echo "Starting bcftools annotation... this will take several minutes."
bcftools annotate --rename-chrs "$MAP_FILE" --threads 8 -o "$NEW_GNOMAD" -O z "$OLD_GNOMAD"
echo "Annotation complete. Now creating a Tabix (.tbi) index..."
bcftools index -t "$NEW_GNOMAD"
echo "Preparation complete. The new file '$NEW_GNOMAD' and its .tbi index are ready."
fi
This step uses CNVkit on the newly created BAM
files with correct headers.
#!/bin/bash
# Activate the conda environment
source /home/bioinfo/miniconda3/etc/profile.d/conda.sh
conda activate wes_env
# --- Define Paths ---
BASE_DIR="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1"
REF_GENOME="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/hg19/hg19.fa"
INTERVAL_LIST="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/reference/PJ1610240_covered.bed"
BAM_DIR="$BASE_DIR/hg19_analysis/bams_with_rg"
NORMAL_BAM="$BAM_DIR/N1.with_rg.bam"
REFERENCE_CNN="$BASE_DIR/hg19_analysis/cnv_results/reference.cnn"
echo "Building reference from N1 (using corrected BAM)..."
cnvkit.py batch "$NORMAL_BAM" \
--normal \
--targets "$INTERVAL_LIST" \
--fasta "$REF_GENOME" \
--output-reference "$REFERENCE_CNN" \
--output-dir "$BASE_DIR/hg19_analysis/cnv_results/N1" \
-p 8 # <--- CORRECTED
for SAMPLE in T1 X1 XX1; do
echo "--- Calling CNVs for $SAMPLE in hg19 (using corrected BAM) ---"
SAMPLE_BAM="$BAM_DIR/${SAMPLE}.with_rg.bam"
cnvkit.py batch "$SAMPLE_BAM" \
--reference "$REFERENCE_CNN" \
--output-dir "$BASE_DIR/hg19_analysis/cnv_results/$SAMPLE" \
-p 8 # <--- CORRECTED
done
echo "hg19 CNVkit analysis complete."
This step converts the coordinates of the final .cns
segment files from hg19 to hg38 using the
robust liftOver tool.
#!/bin/bash
# Activate the conda environment
source /home/bioinfo/miniconda3/etc/profile.d/conda.sh
conda activate wes_env
# --- Define Paths ---
BASE_DIR="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1"
LIFTOVER_CHAIN="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/hg19/hg19ToHg38.over.chain.gz"
echo "--- Lifting over CNV segments from hg19 to hg38 using liftOver ---"
for SAMPLE in T1 X1 XX1; do
# --- CORRECTED FILENAME ---
# CNVkit names the output file based on the input BAM's basename, which was "SAMPLE.with_rg"
CNS_HG19="$BASE_DIR/hg19_analysis/cnv_results/$SAMPLE/${SAMPLE}.with_rg.cns"
# --------------------------
CNS_HG38="$BASE_DIR/hg19_analysis/liftover_results/${SAMPLE}.hg38.cns"
UNMAPPED_FILE="$BASE_DIR/hg19_analysis/liftover_results/${SAMPLE}.unmapped.txt"
if [ -f "$CNS_HG38" ]; then
echo "Lifted file for $SAMPLE already exists, skipping."
continue
fi
# Check if the input file exists before trying to lift it over
if [ ! -f "$CNS_HG19" ]; then
echo "ERROR: Input file $CNS_HG19 not found. Cannot perform liftover."
continue # Skip to the next sample
fi
echo "Lifting over $CNS_HG19 -> $CNS_HG38"
tail -n +2 "$CNS_HG19" | liftOver -bedPlus=4 stdin "$LIFTOVER_CHAIN" "$CNS_HG38" "$UNMAPPED_FILE"
done
echo "LiftOver complete. hg38-coordinate results are in hg19_analysis/liftover_results/"
This step uses GATK Mutect2 and is now fully corrected
and optimized.
#!/bin/bash
# Activate the conda environment
source /home/bioinfo/miniconda3/etc/profile.d/conda.sh
conda activate wes_env
# --- Define Paths ---
BASE_DIR="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1"
REF_GENOME="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/hg19/hg19.fa"
GNOMAD_HG19_CORRECTED="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/reference/gnomad.exomes.r2.0.2.sites.chr.vcf.gz"
BAM_DIR="$BASE_DIR/hg19_analysis/bams_with_rg"
NORMAL_BAM="$BAM_DIR/N1.with_rg.bam"
JAVA_OPTS="-Xmx16g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true"
for TUMOR in T1 X1 XX1; do
echo "--- Calling somatic variants for $TUMOR vs N1 in hg19 ---"
TUMOR_BAM="$BAM_DIR/${TUMOR}.with_rg.bam"
RAW_VCF_OUT="$BASE_DIR/hg19_analysis/somatic_variants/${TUMOR}.hg19.raw.vcf.gz"
FILTERED_VCF_OUT="$BASE_DIR/hg19_analysis/somatic_variants/${TUMOR}.hg19.filtered.vcf.gz"
gatk --java-options "$JAVA_OPTS" Mutect2 \
-R "$REF_GENOME" \
-I "$TUMOR_BAM" \
-I "$NORMAL_BAM" \
-tumor "$TUMOR" \
-normal "N1" \
--germline-resource "$GNOMAD_HG19_CORRECTED" \
--native-pair-hmm-threads 8 \
-O "$RAW_VCF_OUT"
if [ -f "$RAW_VCF_OUT" ]; then
echo "Successfully created raw VCF for $TUMOR. Now filtering..."
gatk --java-options "$JAVA_OPTS" FilterMutectCalls \
-V "$RAW_VCF_OUT" \
-R "$REF_GENOME" \
-O "$FILTERED_VCF_OUT"
else
echo "ERROR: Mutect2 failed to create the raw VCF for $TUMOR. Halting."
exit 1
fi
done
echo "Somatic variant calling and filtering complete."
This step annotates the filtered VCF files, preparing them for visualization.
#!/bin/bash
# Activate the DEDICATED VEP environment
source /home/bioinfo/miniconda3/etc/profile.d/conda.sh
conda activate vep_env
# --- Define Paths ---
BASE_DIR="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1"
REF_GENOME="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/hg19/hg19.fa"
VEP_SCRIPT_PATH="/home/bioinfo/tools/ensembl-vep/vep"
echo "--- Annotating somatic variants with VEP ---"
for TUMOR in T1 X1 XX1; do
FILTERED_VCF="$BASE_DIR/hg19_analysis/somatic_variants/${TUMOR}.hg19.filtered.vcf.gz"
ANNOTATED_VCF="$BASE_DIR/hg19_analysis/somatic_variants/${TUMOR}.hg19.annotated.vcf"
if [ -f "$ANNOTATED_VCF" ]; then
echo "Annotated file for $TUMOR already exists, skipping."
continue
fi
# --- CORRECTED COMMAND ---
perl "$VEP_SCRIPT_PATH" \
--input_file "$FILTERED_VCF" \
--output_file "$ANNOTATED_VCF" \
--format vcf \
--vcf \
--symbol \
--terms SO \
--tsl \
--hgvs \
--fasta "$REF_GENOME" \
--cache \
--force_overwrite \
--fork 8 \
--offline \
--assembly GRCh37 # <--- ADDED THESE TWO FLAGS
done
echo "VEP annotation complete."
This final section generates the key plots for your analysis.
This heatmap is generated from the hg38 coordinate files
for direct comparison with your inferCNV plots.
#!/bin/bash
# Activate the conda environment
source /home/bioinfo/miniconda3/etc/profile.d/conda.sh
conda activate wes_env
# --- Define Paths ---
BASE_DIR="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1"
CNV_DIR="$BASE_DIR/hg19_analysis/cnv_results"
VIS_DIR="$BASE_DIR/hg19_analysis/visualizations"
# Create the visualization directory if it doesn't exist
mkdir -p "$VIS_DIR"
echo "--- Generating CNV heatmap from hg19 .cnr files ---"
# The heatmap command works with .cnr files, not .cns files.
# We will use the original hg19 .cnr files for this gene-centric view.
cnvkit.py heatmap \
"$CNV_DIR/T1/T1.with_rg.cnr" \
"$CNV_DIR/X1/X1.with_rg.cnr" \
"$CNV_DIR/XX1/XX1.with_rg.cnr" \
-o "$VIS_DIR/WES_CNV_Heatmap_TopGenes_hg19.pdf"
echo "CNV heatmap of top variable genes saved to $VIS_DIR/WES_CNV_Heatmap_TopGenes_hg19.pdf"
# Load necessary libraries
library(dplyr)
library(ggplot2)
library(patchwork) # For combining plots
# --- Define Paths ---
liftover_dir <- "/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/hg19_analysis/liftover_results"
samples <- c("T1", "X1", "XX1")
# --- Read and Combine Data ---
# Create a list to hold the data for each sample
all_cns_data <- list()
for (sample_id in samples) {
file_path <- file.path(liftover_dir, paste0(sample_id, ".hg38.cns"))
# Read the data and add a column for the sample name
cns_data <- read.delim(file_path, header = TRUE, sep = "\t") %>%
mutate(sample = sample_id)
all_cns_data[[sample_id]] <- cns_data
}
# Combine all data frames into one
combined_data <- bind_rows(all_cns_data)
# --- Prepare Chromosome Ordering ---
# Ensure chromosomes are ordered numerically then by letter (chr1, chr2, ..., chrX)
combined_data$chromosome <- factor(combined_data$chromosome,
levels = paste0("chr", c(1:22, "X", "Y")))
# --- Create the Plot ---
cnv_plot <- ggplot(combined_data, aes(x = start / 1e6, y = log2)) +
geom_point(aes(color = log2), size = 0.7) +
# Use facet_grid to create a separate plot for each sample, arranged vertically
facet_grid(sample ~ chromosome, scales = "free_x", space = "free_x") +
scale_color_gradient2(low = "blue", mid = "grey", high = "red", midpoint = 0) +
labs(
title = "Combined CNV Scatter Plot (hg38 Coordinates)",
x = "Chromosome",
y = "Copy Ratio (log2)"
) +
theme_bw() +
theme(
axis.text.x = element_blank(), # Hide the x-axis numbers
axis.ticks.x = element_blank(),
panel.spacing.x = unit(0.1, "lines"),
strip.text.x = element_text(angle = 90, size = 8), # Rotate chromosome labels
legend.position = "bottom"
)
# Print the plot
print(cnv_plot)
By running these two chunks, you will get both visualizations you
need: 1. A PDF heatmap showing the copy number of the
most variable genes across your samples. 2. A combined scatter
plot showing the complete genome-wide CNV profile for all three
samples, perfect for comparing with inferCNV.
#!/bin/bash
# Activate the conda environment
source /home/bioinfo/miniconda3/etc/profile.d/conda.sh
conda activate wes_env
VCF_DIR="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/hg19_analysis/somatic_variants"
echo "--- Checking for 'PASS' variants in annotated VCF files ---"
for TUMOR in T1 X1 XX1; do
VCF_FILE="$VCF_DIR/${TUMOR}.hg19.annotated.vcf"
if [ -f "$VCF_FILE" ]; then
# Use grep to count lines that are not comments and contain 'PASS'
PASS_COUNT=$(grep -v "^#" "$VCF_FILE" | grep -c "PASS")
echo "Found $PASS_COUNT 'PASS' variants in $VCF_FILE"
else
echo "ERROR: File not found: $VCF_FILE"
fi
done
#!/bin/bash
# This script performs a step-by-step dissection of the VEP annotation string.
VCF_FILE="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/hg19_analysis/somatic_variants/T1.hg19.annotated.vcf"
echo "--- Final Check: Dissecting the first PASS variant's annotation ---"
# 1. Get the full INFO field from the first PASS variant
INFO_FIELD=$(grep -v "^#" "$VCF_FILE" | grep "PASS" | head -n 1 | awk '{print $8}')
# 2. Isolate the full CSQ string from the INFO field
CSQ_STRING=$(echo "$INFO_FIELD" | grep -o "CSQ=[^;]*" | sed 's/CSQ=//')
echo "------------------------------------------------"
echo "Step 1: Full CSQ string:"
echo "$CSQ_STRING"
echo "------------------------------------------------"
# 3. Isolate the VERY FIRST annotation (everything before the first comma)
FIRST_ANNOTATION=$(echo "$CSQ_STRING" | cut -d ',' -f 1)
echo "Step 2: First annotation block:"
echo "$FIRST_ANNOTATION"
echo "------------------------------------------------"
# 4. Split the first annotation by '|' and extract fields 2 and 4
CONSEQUENCE=$(echo "$FIRST_ANNOTATION" | cut -d '|' -f 2)
GENE_SYMBOL=$(echo "$FIRST_ANNOTATION" | cut -d '|' -f 4)
echo "Step 3: Final Extracted Fields:"
echo " - Consequence (Field 2): $CONSEQUENCE"
echo " - Gene Symbol (Field 4): $GENE_SYMBOL"
echo "------------------------------------------------"
if [ -n "$GENE_SYMBOL" ]; then
echo "Conclusion: Bash check successful. Gene Symbol was found in Field 4."
else
echo "Conclusion: Bash check FAILED. Gene Symbol was NOT found in Field 4."
fi
This R chunk visualizes the somatic mutation landscape to show clonal relationships.
# Load necessary libraries
library(vcfR)
library(dplyr)
library(stringr)
library(ComplexHeatmap)
# --- Final, Corrected function to parse VEP CSQ field ---
parse_mutect_vcf_final <- function(vcf_file, sample_name) {
if (!file.exists(vcf_file)) {
warning(paste("VCF file not found:", vcf_file))
return(NULL)
}
vcf <- read.vcfR(vcf_file, verbose = FALSE)
if (!"PASS" %in% getFILTER(vcf)) return(NULL)
pass_vcf <- vcf[getFILTER(vcf) == "PASS", ]
if (nrow(pass_vcf) == 0) return(NULL)
csq <- extract.info(pass_vcf, "CSQ")
results <- lapply(csq, function(variant_csq) {
if (is.na(variant_csq)) return(NULL)
annotations <- strsplit(variant_csq, ",")[[1]]
best_gene <- NA
best_consequence_type <- "other"
for (a in annotations) {
fields <- strsplit(a, "\\|")[[1]]
# --- THIS IS THE CRITICAL FIX ---
# Explicitly get the 4th field for the gene. This will be a single string.
current_gene <- fields[4]
current_consequence_str <- fields[2]
current_mut_type <- "other"
if (grepl("missense_variant", current_consequence_str)) current_mut_type <- "missense"
if (grepl("frameshift_variant|stop_gained|splice_acceptor|splice_donor", current_consequence_str)) current_mut_type <- "truncating"
# Prioritize finding the most damaging mutation and its associated gene
if (current_mut_type == "truncating") {
best_consequence_type <- "truncating"
best_gene <- current_gene
break
} else if (current_mut_type == "missense" && best_consequence_type != "truncating") {
best_consequence_type <- "missense"
best_gene <- current_gene
} else if (best_consequence_type == "other") {
best_gene <- current_gene
}
}
# The 'if' statement will now work because best_gene is a single value
if (is.na(best_gene) || best_gene == "") return(NULL)
return(data.frame(gene = best_gene, mut_type = best_consequence_type, sample = sample_name))
})
df <- do.call(rbind, results)
return(df)
}
# --- Process each annotated VCF file ---
vcf_dir <- "/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/hg19_analysis/somatic_variants"
t1_muts <- parse_mutect_vcf_final(file.path(vcf_dir, "T1.hg19.annotated.vcf"), "T1")
x1_muts <- parse_mutect_vcf_final(file.path(vcf_dir, "X1.hg19.annotated.vcf"), "X1")
xx1_muts <- parse_mutect_vcf_final(file.path(vcf_dir, "XX1.hg19.annotated.vcf"), "XX1")
all_muts <- rbind(t1_muts, x1_muts, xx1_muts)
# --- Create the matrix for OncoPrint ---
if (!is.null(all_muts) && nrow(all_muts) > 0) {
genes <- unique(all_muts$gene)
samples <- c("T1", "X1", "XX1")
mat <- matrix("", nrow = length(genes), ncol = length(samples), dimnames = list(genes, samples))
for (i in 1:nrow(all_muts)) {
gene_i <- all_muts$gene[i]
sample_i <- all_muts$sample[i]
mut_type_i <- all_muts$mut_type[i]
current_val <- mat[gene_i, sample_i]
if (current_val == "" || (current_val == "other" && mut_type_i != "other") || (current_val == "missense" && mut_type_i == "truncating")) {
mat[gene_i, sample_i] <- mut_type_i
}
}
num_mutated_across_samples <- rowSums(mat != "")
num_to_plot <- min(30, sum(num_mutated_across_samples > 0))
top_genes <- names(sort(num_mutated_across_samples, decreasing = TRUE)[1:num_to_plot])
mat_filtered <- mat[top_genes, , drop = FALSE]
col <- c("truncating" = "#922B21", "missense" = "#2471A3", "other" = "#D5D8DC")
alter_fun <- list(
background = function(x, y, w, h) grid.rect(x, y, w-unit(2,"pt"), h-unit(2,"pt"), gp=gpar(fill="#f0f0f0", col=NA)),
truncating = function(x, y, w, h) grid.rect(x, y, w-unit(2,"pt"), h-unit(2,"pt"), gp=gpar(fill=col["truncating"], col=NA)),
missense = function(x, y, w, h) grid.rect(x, y, w-unit(2,"pt"), h*0.5, gp=gpar(fill=col["missense"], col=NA)),
other = function(x, y, w, h) grid.rect(x, y, w-unit(2,"pt"), h*0.2, gp=gpar(fill=col["other"], col=NA))
)
oncoPrint(mat_filtered,
alter_fun = alter_fun,
col = col,
row_order = rownames(mat_filtered),
column_title = "Somatic Mutation Landscape (Top Genes)",
heatmap_legend_param = list(title = "Alterations", at = c("truncating", "missense", "other")))
} else {
print("No mutations found to plot. This indicates a persistent issue with VCF parsing.")
}
#!/bin/bash
# Corrected VCF directory path
VCF_DIR="/home/bioinfo/1-Thesis_Final_Year_2025/2025-Year3_Analysis/1-scRNA_RESULTS-19-11-2025/15-WES_Patient1_L1_L2_Analysis-19-11-2025/Exome_Patient_1/hg19_analysis/somatic_variants"
echo "--- Searching for TP53 mutations in all samples (Corrected Path) ---"
for TUMOR in T1 X1 XX1; do
VCF_FILE="$VCF_DIR/${TUMOR}.hg19.annotated.vcf"
echo "--- Checking $TUMOR ---"
if [ ! -f "$VCF_FILE" ]; then
echo "ERROR: VCF file not found at $VCF_FILE"
continue
fi
# Search for lines that are not comments, passed filters, and contain the gene symbol TP53
# The CSQ field is what we need to search in.
grep -v "^#" "$VCF_FILE" | grep "PASS" | grep "CSQ=" | grep -o 'SYMBOL=[^|;]*' | grep "TP53" || echo "No PASSing TP53 mutation found in $TUMOR."
echo ""
done