HH_ova

Environment

References

1. Allele specific copy number analysis, it is better than conventional total copy number analysis because: 
   a) identifies more copy number aberrations   
   b) precisely pinpoint the detection of homo/hetero-zygous deletions, copy-neutral LOH, allele specific gains and amplifications  
   c) more accurate estimate of tumour purity and ploidy

Whole-genome sequencing reveals the molecular implications of the stepwise progression of lung adenocarcinoma

Estimation of clone architecture
Clone numbers were estimated according to VAFs of somatic point mutations using PyClone-VI (version 0.1.1) with the parameters -c 40 -d beta-binomial -r 10. The input files included read count information of somatic mutations from the Mutect2 results and information on copy numbers and cellularity, which were calculated from tumor and normal WGS datasets using FACETS (version 0.6.2). From the BAM files, read count files for dbSNP build 151 (https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/common_all_20180418.vcf.gz) were generated by snp-pileup (version 434b5ce) with the parameters “-g -q15 -Q20 -P100 -r25,0” according to the developer’s tutorial. Next, copy numbers were calculated using FACETS with “cval = 400” for procSample function. The clone evolution of each case was inferred and visualized by ClonEvol (version 0.99.11).
https://github.com/asuzuki-asuzuki/Early-Ad_2023

Comprehensive genomic profiling of breast cancers characterizes germline-somatic mutation interactions mediating therapeutic vulnerabilities

Somatic variant calling
GATK Mutect2 was used to identify somatic mutations. The VCF files were annotated using ANNOVAR (v2015-06-17). The variants and annotation results were transferred to Excel spreadsheets. A panel of normal (PoN) samples was used to screen out expected germline variations and artifacts to improve specificity. Each alteration identified by the pipeline was manually reviewed to confirm that no false-positive variants were reported. SAMtools (v2.6.2) and GATK were used to acquire the sequencing quality statistics. The FACETS algorithm (v0.16.0) was used to detect gene-level amplification and deletion

Patient‐derived organoids for personalized gallbladder cancer modelling and drug screening.

WES 
Copy-number analyses
For WES data, FACETS (v0.5.0) was applied for calling copy number alterations.37 The paired tumor and normal tissue sorted bam
files containing SNP locations, were used to calculate the counts of the reference nucleotide, alternative nucleotide, errors, and deletions of each SNP. Then the result files were used in facets as input to estimate cellular fraction and copy numbers.

Running Notebook.

knitr::opts_chunk$set(eval = FALSE)

Install Packages.

if(requireNamespace("ABSOLUTE", quietly = TRUE)) {
  print("Package ABSOLUTE exists. ")
} else {
  library(numDeriv)
  devtools::install_github("ShixiangWang/DoAbsolute")
  path_to_file = system.file("extdata", "ABSOLUTE_1.0.6.tar.gz", package = "DoAbsolute", mustWork = TRUE, 
                             lib.loc = "/rds/general/user/ph323/home/anaconda3/lib/R/library")
  install.packages(path_to_file, repos = NULL, type = "source")
} 

if(requireNamespace("facetsSuite", quietly = TRUE)) {
  print("Package facetsSuite exists.")
} else {
  devtools::install_github("mskcc/facets-suite")
}

if(requireNamespace("facetsPreview", quietly = TRUE)) {
  print("Package facetsPreview exists. ")
} else {
  devtools::install_github("taylor-lab/facets-preview", ref = "master")
}

# Install GenVisR if you haven't already
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GenVisR")

Library packages.

library(dplyr)
library(magrittr)
library(httr)
library(Rsamtools)
library(facets)
library(ASCAT)
library(data.tree)
library(ABSOLUTE)
library(caroline)
library(DoAbsolute)
library(facetsSuite)
library(argparse)
library(ggplot2)
library(egg)
library(purrr)
library(tibble)
library(GenVisR)
library(Gviz)

Control Board.

PRINT <- TRUE
EXECUTE <- FALSE
DATA <- "/rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov"
RESULT <- "~/MRes_project_1/docs/HH_ova/facet_input"
EXAMPLE <- "~/MRes_project_1/Codes/3_therapy_resp/example"
GATK <- "~/MRes_project_1/otherCodes/gatk-4.2.5.0/gatk" ## this is also a file exec
SNP_PILEUP <- "~/MRes_project_1/Codes/3_therapy_resp/pileup_exe/snp-pileup" ## this is a file exec
VCF_FILE <- "/MRes_project_1/Codes/3_therapy_resp/vcf_file"

FACETS

Required Files:
1. snp-pileup exec with parameter specified below
2. vcf files.
3. output file name.
4. normal sample bam files.
5. tumour sample bam files.

snp-pileup

This application will, given a VCF file containing SNP locations, output for each SNP the counts of the reference nucleotide, alternative nucleotide, errors, and deletions. These counts can then be used in facets.
Was developed on a linux machine and tested with htslib v1.3.1 conda install bioconda::htslib

Installation
First, HTSlib must be installed on your system. To do that, download it from http://www.htslib.org/download/ and follow the “Building and installing” instructions on that page. If installed systemwide (in /usr/local/lib) using “make install” ensure that the libraries are available with the command “sudo ldconfig” (only needs to be run once).
This code can be compiled using g++ -std=c++11 snp-pileup.cpp -lhts -o snp-pileup when htslib is available systemwide, or g++ -std=c++11 -I/path/htslib/include snp-pileup.cpp -L/path/htslib/lib -lhts -Wl,-rpath=/path/htslib/lib -o snp-pileup when it is installed locally and path is the location where it is available.

#!/bin/bash

#PBS -l select=1:ncpus=1:mem=10gb
#PBS -l walltime=2:00:00
#PBS -N install_snp_pileup

## switch to facet directory snp folder 
cd ~/MRes_project_1/Codes/3_therapy_resp/pileup_exe/

## if snp-pileup does not exist, then 
if [ ! -f snp-pileup ]; then
    g++ -std=c++11 snp-pileup.cpp -lhts -o snp-pileup ## compile file 
else
    find . -type f -name 'snp-pileup' ## if exist, print snp-pileup
fi

Usage
snp-pileup <vcf file> <output file> <sequence files...>
Usage of snp-pileup requires a VCF file and one (or multiple) sequence filescontaining DNA. The sequence files should be in the BAM format, and both the VCF and all sequence files must be sorted. A suitable option for VCF is one from NCBI. (WARNING: this link points to the VCF for the current version of genome build).
Use a VCF file that is consistent with the genome build used for aligning the sequencing data. Available snp/genome build versions are in directories human_9606_b149_GRCh37p13, human_9606_b150_GRCh37p13, human_9606_b149_GRCh38p7, human_9606_b150_GRCh38p7.

VCF-files.

Compare Bam file and VCF file Chromosome orders

#!/bin/bash

#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=2:00:00
#PBS -N get_order_files

## change to vcf directory 
cd ~/MRes_project_1/Codes/3_therapy_resp/vcf_file
if [ ! -f 00-common_all.vcf.gz ] && [ ! -f vcf_order.txt ]; then
    ## clear environment 
    rm -f ./* 
    ## download vcf file for hg38 from broad institute    
    wget https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz 
    ## extract chr list 
    bcftools query -f '%CHROM\n' 00-common_all.vcf.gz | uniq > vcf_order.txt
else
    find . -type f -name 'vcf_order.txt' ## if exist, print snp-pileup
fi

## change to example file directory 
cd ~/MRes_project_1/Codes/3_therapy_resp/example
## check the alignment of the bam files (normal-tumour paired) and vcf files 
samtools idxstats X991T_sorted_nodup.bam | cut -f 1 | uniq > X991T_order.txt
samtools idxstats X991N_sorted_nodup.bam | cut -f 1 | uniq > X991N_order.txt
samtools idxstats X970T_sorted_nodup.bam | cut -f 1 | uniq > X970T_order.txt
samtools idxstats X970N_sorted_nodup.bam | cut -f 1 | uniq > X970N_order.txt

ls -l

## Check: Are the order of chromosomes same across all the tumour and normal samples for different patients? 
setwd("~/MRes_project_1/Codes/3_therapy_resp/example")
files <- list.files(pattern = "_order\\.txt$") ## list the ordering files 
order <- lapply(files, read.table) %>% as.data.frame()
colnames(order) <- gsub("_order.txt", "", files)
order <- order[1:194, ] ## remove the asterisk
ANSWER <- lapply(order, identical, order$X970N) %>% unlist ## proving that the chromosome order is unified across all samples 
print(paste("Are the order of chromosomes same across all the tumour and normal samples for different patients?", ANSWER))

## Check: Is VCF order same as other orders? 
vcf_order <- read.table("~/MRes_project_1/Codes/3_therapy_resp/vcf_file/vcf_order.txt") %>% unlist
ANSWER <- identical(order$standard, vcf_order)
print(paste("Is VCF order same as other orders?", ANSWER))
print(vcf_order)

## order + vcf 
vcf_new_order <- order$X970N[c(1:22, 24:25)]
write.table(vcf_new_order, file = "~/MRes_project_1/Codes/3_therapy_resp/vcf_file/vcf_new_order.txt", 
            row.names = FALSE, quote = FALSE)

Realign the vcf files according to the new order.

#!/bin/bash

#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=2:00:00
#PBS -N resort_vcf_file

## reorder vcf file 
cd ~/MRes_project_1/Codes/3_therapy_resp/vcf_file/

if [ ! -f sorted_vcf_file.vcf.gz ] && [ ! -f sorted_vcf_file.vcf ]; then 
    gunzip 00-common_all.vcf.gz
    grep '^#' 00-common_all.vcf > header.vcf
    awk 'NR==FNR{a[$1]=NR; next} !/^#/ {print a[$1],$0}' vcf_new_order.txt 00-common_all.vcf | sort -k1,1n | cut -d' ' -f2- > sorted_body.vcf
    cat header.vcf sorted_body.vcf > sorted_vcf_file.vcf
    bgzip sorted_vcf_file.vcf 
else
    find . -type f -name 'sorted_vcf_file.vcf.gz' ## if exist, print snp-pileup
fi

bam Files.

There are in total 119 ovarian cancer samples, of those, 98 of them are paired normal-tumour samples while thes rest 21 only have tumour samples available. Because FACETS requires normal tumour pair, therefore we will set a random sample “X991N” as control against the 21 single samples. In the FACETS analysis, we will specify that the pair is not matched.
For SNP-pileup we will formulate two lists of bam file names for snp-pileup exec to call upon.

sequencing quality control.

SAMtools (v.1.14.0) and GATK were used to acquire sequence quality statistics.
Tutorial

Depth of Sequencing Coverage
1. Sequencing coverage: percent (or total number) of target bases that have been sequenced relative to a reference set of bases. In WES different platforms, having different bait sets that sequence different regions of the genome and therefore will have different references. Therefore, providing a reference set of sequenced regions will not only speed up computatinal methods for determining sequencing depth, but it will also enable accurate quantification of sequencing depth. WES sequences all bases included in the genome.
2. Sequencing depth: how many times each coverage base is sequenced on average. Sequencing depth is an important consideration to any next-generation sequencing (NGS) analysis, because it plays a critical role in our ability to call germline and somatic mutations. The higher (deeper) the sequencing depth, the more statistically powered we are to identify true alterations and differentiate them from artifacts (i.e., false positives). For example, when calling germline alterations we expect the alterations to have a variant allele fraction of 50% (i.e., alteration at locus on 1 copy of chromosome). One way we can quickly determine if the germline alteration call is a false positive is via a binomial distributionwhere the expected probability of success is 0.5 binom.test(alt_reads, 10, p=0.5). If the probability that the proportion of alternate reads given the sequencing depth significantly deviates from the expected VAF of a germline alteration (50%), then the germline alteration call is likely an artifact. As we can see from the plots above, the higher the sequencing depth, the more confident we are that observed VAF for a true germline alteration call will be closer to 50%. Conversely, looking at the top left panel (sequencing depth of 10), we cannot confidently say that a germline alteration with an observed VAF of 30% significantly deviates from the expected VAF of 50%. In practice, one should also look at IGV snapshots of the germline alterations when performing artifact filtering (if feasible). For somatic mutation calling, sequencing depth primarily affects our ability to call subclonal mutations (i.e. mutations that are only in a fraction of tumor cells). The lower the cancer cell fraction (CCF), the higher the sequence coverage required to confidently call the mutations. The ABSOLUTE paper provides an excellent explanation on how tumor purity, local copy number, and sequencing depth affect the power to call mutations.

DepthOfCoverage (GATK3.7)
The descriptions of the inputs are as follows:

sample_ids_NT <- gsub("_sorted_nodup.bam", "", file_list)
write.table(sample_ids_NT, "~/MRes_project_1/Codes/3_therapy_resp/sample_ids_NT.txt", 
            row.names = FALSE, quote = FALSE)

Reference FASTA file
-R Reference FASTA file for the genome reference used (e.g. Hg19 or Hg38)

/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/sure_select/hg38.fa
/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/sure_select/hg38.fa.fai

The genomic intervals (regions)
-L The genomic intervals (regions) that we want coverage information on. For whole-exomes these would be the regions captured by the bait kits. For genomes this can be left empty.
/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/sure_select/S07604514_Covered.bed
BedToIntervalList (Picard)

#!/bin/bash

#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=2:00:00
#PBS -N interval_list

## change to correct directory 
cd /rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/sure_select

## download the fasta file and make fasta.fai file as well 
if [ ! -f hg38.fa.fai ]; then
    wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz ## download 
    gunzip hg38.fa.gz ## unzip 
    samtools faidx hg38.fa ## create fasta.fai
    
else
    find . -type f -name 'hg38.fa.fai' 
fi

## create interval list 
if [ ! -f S07604514_Covered.interval_list ]; then
    ## use the GATK BedToIntervalList function to create interval_list 
    java -jar /rds/general/user/ph323/home/anaconda3/envs/Renv/share/picard-2.18.23-0/picard.jar BedToIntervalList \
        I=S07604514_Covered.bed \
        O=S07604514_Covered.interval_list \
        SD=hg38.dict
else
    find . -type f -name 'S07604514_Covered.interval_list' ## if exist, print S07604514_Covered.interval_list
fi

Min Base Quality
minBaseQuality The minimum base quality, determined by the sequencer, to be counted in the coverage metrics
Phred-scaled quality scores

Finding the minMappingQuality
–minMappingQuality The minimum mapping quality, determined by the alignment algorithm (e.g. BWA), to be counted in the coverage metrics. A threshold that’s too high may exclude too many reads, while a threshold that’s too low may include low-quality alignments.

#!/bin/bash

#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=2:00:00
#PBS -N mapping_quality

# Directory containing BAM files
BAM_DIR="/rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov"
# Output directory for the mapping quality files
OUTPUT_DIR="$HOME/MRes_project_1/docs/HH_ova/facets/mapping_quality"


cd ~/MRes_project_1/Codes/3_therapy_resp

# Loop through each line in sample_ids.txt
while IFS= read -r sample_ids_NT; do
    # Construct the full path to the BAM file
    bam_file="${BAM_DIR}/${sample_ids_NT}_sorted_nodup.bam"

    # Construct the full path for the output file
    output_file="${OUTPUT_DIR}/${sample_ids_NT}_mapping_quality.txt"

    # Check if the BAM file exists
    if [ -f "$bam_file" ]; then
        # Run samtools and store the output in the specified output file
        samtools view "$bam_file" | awk '{print $5}' | sort | uniq -c > "$output_file"
        echo "Processed $bam_file and output to $output_file"
    else
        echo "BAM file not found: $bam_file"
    fi
done < "sample_ids_NT.txt"

Add Read Group Information

cd /rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov
java -jar /rds/general/user/ph323/home/anaconda3/envs/Renv/share/picard-2.18.23-0/picard.jar AddOrReplaceReadGroups \
      I=X991N_sorted_nodup.bam \
      O=X991N_sorted_nodup_rg.bam \
      RGID=X991N_2024-01-19_1 \
      RGLB=SureSelectV6_r2 \
      RGPL=ILLUMINA \
      RGPU=unit1 \
      RGSM=X991

Performing GATK Depth of Coverage Analysis

#!/bin/bash

#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=2:00:00
#PBS -N gatk_score

cd /rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/sure_select/

# Directory containing BAM files
BAM_DIR="/rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov"
# Output directory for the mapping quality files
OUTPUT_DIR="$HOME/MRes_project_1/docs/HH_ova/facets/base_quality"

minMappingQuality=20
minBaseQuality=30
refFasta="hg38.fasta"
intervalList="S07604514_Regions.bed"


while IFS= read -r sample_id; do
    bam_file="${BAM_DIR}/X991N_sorted_nodup_rg.bam"
    output_file="${OUTPUT_DIR}/${sample_id}_depthOfCoverage"

    if [ -f "$bam_file" ]; then
        java -Xmx15g -jar /rds/general/user/ph323/home/anaconda3/envs/Renv/opt/gatk-3.8/GenomeAnalysisTK.jar \
            -R ${refFasta} \
            -T DepthOfCoverage \
            -o ${output_file} \
            -omitBaseOutput \
            -I ${bam_file} \
            -L ${intervalList} \
            --minBaseQuality ${minBaseQuality} \
            --minMappingQuality ${minMappingQuality}
        echo "Processed $bam_file and output to $output_file"
    else
        echo "BAM file not found: $bam_file"
    fi
done < "sample_ids_NT.txt"

bam file lists

## extract the files that is not paired 
# Assuming you're working in the directory containing the files:
file_list <- list.files(path = "/rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov", pattern = "_sorted_nodup.bam$")  
sample_ids <- sub("(X[0-9]+)[NT]_sorted_nodup.bam", "\\1", file_list) ## extract sample ids 
table_ids <- table(sample_ids)

## unpaired files 
unpaired_ova_bam <- file_list[sample_ids %in% names(table_ids[table(sample_ids) == 1])]
## paired files 
paired_ova_bam <- file_list[sample_ids %in% names(table_ids[table(sample_ids) == 2])]

## Process the bam files 
paired_tumour <- paired_ova_bam[grep("T_sorted_nodup.bam$", paired_ova_bam)] %>% sort ## get all the tumour samples 
paired_normal <- paired_ova_bam[grep("N_sorted_nodup.bam$", paired_ova_bam)] %>% sort ## get all the normal samples 
unpaired_tumour <- unpaired_ova_bam[grep("sorted_nodup.bam", unpaired_ova_bam)]
unpaired_normal <- rep("X991N_sorted_nodup.bam", length(unpaired_tumour))

## combine the files 
normal <- c(unpaired_normal, paired_normal) ## the first 21 samples will not be matched 
tumour <- c(unpaired_tumour, paired_tumour) 

## check if the paired bam files are matched up by randomly sample a few  
## for samples from 22 to 119, the sample_ids should be identical.   
## for samples from 1 to 21, the normal sample should all be X991N.    
set.seed(1234)
random <- sample(1:119, 12, replace = FALSE)
for(i in random){
  print(c(i, normal[i], tumour[i]))
}

## process 
write.table(tumour, "~/MRes_project_1/Codes/3_therapy_resp/tumour_order.txt", 
            row.names = FALSE, quote = FALSE)
write.table(normal, "~/MRes_project_1/Codes/3_therapy_resp/normal_order.txt", 
            row.names = FALSE, quote = FALSE)
write.table(unique(sample_ids), "~/MRes_project_1/Codes/3_therapy_resp/sample_ids.txt", 
            row.names = FALSE, quote = FALSE)

snp-pileup:

The input VCF file should contain polymorphic SNPs, so that FACETS can infer changes in allelic configuration at genomic loci from changes in allele ratios. dbSNP is a good source for this. By default, snp-pileup also estimates the read depth in the input BAM files every 50th base.

#!/bin/bash

#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=24:00:00
#PBS -N snp_pileup

## first switch to result directory to check if this step has been done for all samples 
cd ~/MRes_project_1/docs/HH_ova/facets/facet_input/
file_count=$(find . -type f -name '*_ordered.csv' | wc -l) ## count the number of processed files 
N=119 ## number of samples 

# Compare file count
if [ $file_count -lt $N ]; then 
    
    ## change directory to where the bam files were stored 
    cd /rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov
    
    ## set dead parameters: 
    snp_pileup_path="/rds/general/user/ph323/home/MRes_project_1/Codes/3_therapy_resp/pileup_exe/snp-pileup"
    params="-q15 -Q20 -P100 -r25,0"
    vcf="/rds/general/user/ph323/home/MRes_project_1/docs/snp_pileup/sorted_vcf_file.vcf.gz"
    
    ## set live parameters: 
    normal_bams=($(cat /rds/general/user/ph323/home/MRes_project_1/Codes/3_therapy_resp/normal_order.txt))
    tumour_bams=($(cat /rds/general/user/ph323/home/MRes_project_1/Codes/3_therapy_resp/tumour_order.txt))

    # Loop over the BAM files
    for ((i=1; i<=$N; i++)); do
        ## Adjust index for zero-based array
        adjusted_index=$((i - 1))

        ## Extract normal and tumor bam file paths
        normal_bam="${normal_bams[$adjusted_index]}"
        tumor_bam="${tumour_bams[$adjusted_index]}"
        
        ## Construct output file path
        output_file_name=$(basename "${tumor_bam}" "T_sorted_nodup.bam")
        output_file="~/MRes_project_1/docs/HH_ova/facet_input/${output_file_name}.csv"

        ## Construct the command
        cmd="$snp_pileup_path $params $vcf $output_file $normal_bam $tumor_bam"
        
        # Run the command
        echo "Running: $cmd"
        eval $cmd
        
        echo "Finished piling ${output_file_name}"
    done
else 
    find . -type f -name '*_ordered.csv' | wc -l
fi

Reorder the pileup file and gunzip them for subsequent run-facets wrapper

## reorder the snp-pileup outputs before executing facets 
for(cancer in sample_ids){
  ## output file: 
  setwd("~/MRes_project_1/docs/HH_ova/facets/facet_input")
  output <- paste0(cancer, "_ordered.csv")
  gzip_output <- paste0(cancer, "_ordered.csv.gz")
  
  if (!file.exists(output)) {
    ## reorder the snp_pileup files to 1, 2, 3... 22, X, Y format 
    temp <- read.csv(paste0("~/MRes_project_1/docs/HH_ova/facets/facet_input/", cancer, ".csv")) ## read in files 
    chrom_order <- c(as.character(1:22), "X", "Y")  ## set up the order 
    temp <- temp[order(match(temp$Chromosome, chrom_order), temp$Position), ] ## reorder 
    
    write.csv(temp, paste0("~/MRes_project_1/docs/HH_ova/facets/facet_input/", cancer, "_ordered.csv"), 
              quote = FALSE, col.names = TRUE, row.names = FALSE)  ## save the files 
  } else if (!file.exists(gzip_output)){
    temp <- read.csv(paste0("~/MRes_project_1/docs/HH_ova/facets/facet_input/", cancer, "_ordered.csv"))
    write.csv(temp, gzfile(paste0("~/MRes_project_1/docs/HH_ova/facets/facet_input/", cancer, "_ordered.csv.gz")), 
              row.names = FALSE)
  }
}

Facets

facetsSuite is an R package with functions to run FACETS—an allele-specific copy-number caller for paired tumor-normal DNA-sequencing data from genome-wide and targeted assays. facetSuite both wraps the code to execute the FACETS algorithm itself as well as performs post-hoc analyses on the resulting data. This package was developed by members of the Taylor lab and the Computational Sciences group within the Center for Molecular Oncology at Memorial Sloan Kettering Cancer Center.

This wrapper takes above SNP “pileup” as input and executes the FACETS algorithm. The ouputs are in the form of Rdata objects, TXT files, and PNGs of the samples overall copy-number profile. The wrapper allows for running FACETS in a two-pass mode, where first a “purity” run estimates the overall segmentation profile, sample purity and ploidy, and subsequently the dipLogR value from this run seeds a “high-sensitivity” run which may detect more focal events. To run in the two-pass mode, specify the input arguments prefixed by purity. The cval (–purity-cval and –cval) parameters tune the segmentation coarseness.

#!/bin/bash

#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=24:00:00
#PBS -N run_facet

# Set Job Home Directory
JOB_HOME=/rds/general/user/ph323/home/MRes_project_1/Codes/3_therapy_resp
cd $JOB_HOME

# Assuming sample_ids.txt is in the JOB_HOME directory
# Use full path if the file is somewhere else
SAMPLE_IDS_PATH="$JOB_HOME/sample_ids.txt"

if [ -f "$SAMPLE_IDS_PATH" ]; then
    mapfile -t cancer_list < "$SAMPLE_IDS_PATH"

    # Loop through each cancer type
    for cancer in "${cancer_list[@]}"; do
        counts_file="/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/facets/facet_input/${cancer}_ordered.csv.gz"
        directory="/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/facets/facet_output/${cancer}"
        mkdir -p "$directory" 

        $JOB_HOME/facets-suite/run-facets-wrapper.R \
            --counts-file "$counts_file" \
            --sample-id ${cancer} \
            --directory "$directory" \
            --everything \
            --genome 'hg38' \
            --cval 25 \
            --purity-cval 50 \
            --facets-lib-path "/rds/general/user/ph323/home/anaconda3/envs/Renv/lib/R/library"
        echo "Processed ${cancer}"
    done
else
    echo "Error: sample_ids.txt not found in $SAMPLE_IDS_PATH"
    exit 1
fi

The above command runs FACETS in the two-pass mode, first at cval 50, then at cval 25 based on the sample-specific baseline found at the higher cval. The full suite of analysis and QC is run with the –everything flag. If no output directory is specified, a directory named sample-id is created.

Create segmentation files.

facet_process <- function(sample_ids, seg_file_type){
  ## sample_id from sample_ids 
  ## seg_file_type: _hisens_diplogR.adjusted.seg, _hisens_diplogR.unadjusted.seg, 
  ##                _purity_diplogR.adjusted.seg, _purity_diplogR.unadjusted.seg
  
  ## build an empty dataframe 
  temp <- matrix(NA, nrow = 1, ncol = 6) %>% as.data.frame
  colnames(temp) <- c("ID", "chrom", "loc.start", "loc.end", "num.mark", "seg.mean")
  
  
  ## loop through each directory and get the dataframe 
  for(sample_id in sample_ids){
    ## set directory and switch to directory 
    directory <- paste0("/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/facets/facet_output/", sample_id)
    setwd(directory)
    
    ## import hisens_adj segmented file 
    segmented_file <- read.table(paste0(sample_id, seg_file_type), header = TRUE)
    temp <- rbind(temp, segmented_file)
  }
  ## remove the first NA row 
  temp <- temp[2:nrow(temp), ]

  ## print the result 
  print(gsub("_|\\.", " ", seg_file_type))
  print(summary(temp))
  
  ## write the file 
  setwd("/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/facets/segmentation_file")
  write.table(temp, seg_file_type, quote = FALSE, row.names = FALSE, col.names = FALSE, sep = "\t")
  
  ## return the file 
  return(temp)
}

hisens_adj <- facet_process(sample_ids, "_hisens_diplogR.adjusted.seg")
hisens_unadj <- facet_process(sample_ids, "_hisens_diplogR.unadjusted.seg")
purity_adj <- facet_process(sample_ids, "_purity_diplogR.adjusted.seg")
purity_unadj <- facet_process(sample_ids, "_purity_diplogR.unadjusted.seg")

Visualise our segmentation file