1. Allele specific copy number analysis, it is better than conventional total copy number analysis because:
a) identifies more copy number aberrations
b) precisely pinpoint the detection of homo/hetero-zygous deletions, copy-neutral LOH, allele specific gains and amplifications
c) more accurate estimate of tumour purity and ploidy
Estimation of clone architecture
Clone numbers were estimated according to VAFs of somatic point mutations using PyClone-VI (version 0.1.1) with the parameters -c 40 -d beta-binomial -r 10. The input files included read count information of somatic mutations from the Mutect2 results and information on copy numbers and cellularity, which were calculated from tumor and normal WGS datasets using FACETS (version 0.6.2). From the BAM files, read count files for dbSNP build 151 (https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/common_all_20180418.vcf.gz) were generated by snp-pileup (version 434b5ce) with the parameters “-g -q15 -Q20 -P100 -r25,0” according to the developer’s tutorial. Next, copy numbers were calculated using FACETS with “cval = 400” for procSample function. The clone evolution of each case was inferred and visualized by ClonEvol (version 0.99.11).
https://github.com/asuzuki-asuzuki/Early-Ad_2023
Somatic variant calling
GATK Mutect2 was used to identify somatic mutations. The VCF files were annotated using ANNOVAR (v2015-06-17). The variants and annotation results were transferred to Excel spreadsheets. A panel of normal (PoN) samples was used to screen out expected germline variations and artifacts to improve specificity. Each alteration identified by the pipeline was manually reviewed to confirm that no false-positive variants were reported. SAMtools (v2.6.2) and GATK were used to acquire the sequencing quality statistics. The FACETS algorithm (v0.16.0) was used to detect gene-level amplification and deletion
WES
Copy-number analyses
For WES data, FACETS (v0.5.0) was applied for calling copy number alterations.37 The paired tumor and normal tissue sorted bam
files containing SNP locations, were used to calculate the counts of the reference nucleotide, alternative nucleotide, errors, and deletions of each SNP. Then the result files were used in facets as input to estimate cellular fraction and copy numbers.
knitr::opts_chunk$set(eval = FALSE)
if(requireNamespace("ABSOLUTE", quietly = TRUE)) {
print("Package ABSOLUTE exists. ")
} else {
library(numDeriv)
devtools::install_github("ShixiangWang/DoAbsolute")
path_to_file = system.file("extdata", "ABSOLUTE_1.0.6.tar.gz", package = "DoAbsolute", mustWork = TRUE,
lib.loc = "/rds/general/user/ph323/home/anaconda3/lib/R/library")
install.packages(path_to_file, repos = NULL, type = "source")
}
if(requireNamespace("facetsSuite", quietly = TRUE)) {
print("Package facetsSuite exists.")
} else {
devtools::install_github("mskcc/facets-suite")
}
if(requireNamespace("facetsPreview", quietly = TRUE)) {
print("Package facetsPreview exists. ")
} else {
devtools::install_github("taylor-lab/facets-preview", ref = "master")
}
# Install GenVisR if you haven't already
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GenVisR")
library(dplyr)
library(magrittr)
library(httr)
library(Rsamtools)
library(facets)
library(ASCAT)
library(data.tree)
library(ABSOLUTE)
library(caroline)
library(DoAbsolute)
library(facetsSuite)
library(argparse)
library(ggplot2)
library(egg)
library(purrr)
library(tibble)
library(GenVisR)
library(Gviz)
PRINT <- TRUE
EXECUTE <- FALSE
DATA <- "/rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov"
RESULT <- "~/MRes_project_1/docs/HH_ova/facet_input"
EXAMPLE <- "~/MRes_project_1/Codes/3_therapy_resp/example"
GATK <- "~/MRes_project_1/otherCodes/gatk-4.2.5.0/gatk" ## this is also a file exec
SNP_PILEUP <- "~/MRes_project_1/Codes/3_therapy_resp/pileup_exe/snp-pileup" ## this is a file exec
VCF_FILE <- "/MRes_project_1/Codes/3_therapy_resp/vcf_file"
Required Files:
1. snp-pileup exec with parameter specified below
2. vcf files.
3. output file name.
4. normal sample bam files.
5. tumour sample bam files.
This application will, given a VCF file containing SNP locations,
output for each SNP the counts of the reference nucleotide, alternative
nucleotide, errors, and deletions. These counts can then be used in
facets.
Was developed on a linux machine and tested with htslib v1.3.1
conda install bioconda::htslib
Installation
First, HTSlib must be installed on your system. To do that, download it
from http://www.htslib.org/download/ and follow the “Building
and installing” instructions on that page. If installed systemwide (in
/usr/local/lib) using “make install” ensure that the libraries are
available with the command “sudo ldconfig” (only needs to be run
once).
This code can be compiled using
g++ -std=c++11 snp-pileup.cpp -lhts -o snp-pileup when
htslib is available systemwide, or
g++ -std=c++11 -I/path/htslib/include snp-pileup.cpp -L/path/htslib/lib -lhts -Wl,-rpath=/path/htslib/lib -o snp-pileup
when it is installed locally and path is the location where it is
available.
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=10gb
#PBS -l walltime=2:00:00
#PBS -N install_snp_pileup
## switch to facet directory snp folder
cd ~/MRes_project_1/Codes/3_therapy_resp/pileup_exe/
## if snp-pileup does not exist, then
if [ ! -f snp-pileup ]; then
g++ -std=c++11 snp-pileup.cpp -lhts -o snp-pileup ## compile file
else
find . -type f -name 'snp-pileup' ## if exist, print snp-pileup
fi
Usage
snp-pileup <vcf file> <output file> <sequence files...>
Usage of snp-pileup requires a VCF file and one (or multiple) sequence
filescontaining DNA. The sequence files should be in the BAM format, and
both the VCF and all sequence files must be sorted. A suitable option
for VCF is one from NCBI. (WARNING: this link
points to the VCF for the current version of genome build).
Use a VCF file that is consistent with the genome build used for
aligning the sequencing data. Available snp/genome build versions are in
directories human_9606_b149_GRCh37p13, human_9606_b150_GRCh37p13,
human_9606_b149_GRCh38p7, human_9606_b150_GRCh38p7.
Compare Bam file and VCF file Chromosome orders
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=2:00:00
#PBS -N get_order_files
## change to vcf directory
cd ~/MRes_project_1/Codes/3_therapy_resp/vcf_file
if [ ! -f 00-common_all.vcf.gz ] && [ ! -f vcf_order.txt ]; then
## clear environment
rm -f ./*
## download vcf file for hg38 from broad institute
wget https://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/00-common_all.vcf.gz
## extract chr list
bcftools query -f '%CHROM\n' 00-common_all.vcf.gz | uniq > vcf_order.txt
else
find . -type f -name 'vcf_order.txt' ## if exist, print snp-pileup
fi
## change to example file directory
cd ~/MRes_project_1/Codes/3_therapy_resp/example
## check the alignment of the bam files (normal-tumour paired) and vcf files
samtools idxstats X991T_sorted_nodup.bam | cut -f 1 | uniq > X991T_order.txt
samtools idxstats X991N_sorted_nodup.bam | cut -f 1 | uniq > X991N_order.txt
samtools idxstats X970T_sorted_nodup.bam | cut -f 1 | uniq > X970T_order.txt
samtools idxstats X970N_sorted_nodup.bam | cut -f 1 | uniq > X970N_order.txt
ls -l
## Check: Are the order of chromosomes same across all the tumour and normal samples for different patients?
setwd("~/MRes_project_1/Codes/3_therapy_resp/example")
files <- list.files(pattern = "_order\\.txt$") ## list the ordering files
order <- lapply(files, read.table) %>% as.data.frame()
colnames(order) <- gsub("_order.txt", "", files)
order <- order[1:194, ] ## remove the asterisk
ANSWER <- lapply(order, identical, order$X970N) %>% unlist ## proving that the chromosome order is unified across all samples
print(paste("Are the order of chromosomes same across all the tumour and normal samples for different patients?", ANSWER))
## Check: Is VCF order same as other orders?
vcf_order <- read.table("~/MRes_project_1/Codes/3_therapy_resp/vcf_file/vcf_order.txt") %>% unlist
ANSWER <- identical(order$standard, vcf_order)
print(paste("Is VCF order same as other orders?", ANSWER))
print(vcf_order)
## order + vcf
vcf_new_order <- order$X970N[c(1:22, 24:25)]
write.table(vcf_new_order, file = "~/MRes_project_1/Codes/3_therapy_resp/vcf_file/vcf_new_order.txt",
row.names = FALSE, quote = FALSE)
Realign the vcf files according to the new order.
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=2:00:00
#PBS -N resort_vcf_file
## reorder vcf file
cd ~/MRes_project_1/Codes/3_therapy_resp/vcf_file/
if [ ! -f sorted_vcf_file.vcf.gz ] && [ ! -f sorted_vcf_file.vcf ]; then
gunzip 00-common_all.vcf.gz
grep '^#' 00-common_all.vcf > header.vcf
awk 'NR==FNR{a[$1]=NR; next} !/^#/ {print a[$1],$0}' vcf_new_order.txt 00-common_all.vcf | sort -k1,1n | cut -d' ' -f2- > sorted_body.vcf
cat header.vcf sorted_body.vcf > sorted_vcf_file.vcf
bgzip sorted_vcf_file.vcf
else
find . -type f -name 'sorted_vcf_file.vcf.gz' ## if exist, print snp-pileup
fi
There are in total 119 ovarian cancer samples, of those, 98 of them
are paired normal-tumour samples while thes rest 21 only have tumour
samples available. Because FACETS requires normal tumour pair, therefore
we will set a random sample “X991N” as control against the 21 single
samples. In the FACETS analysis, we will specify that the pair is
not matched.
For SNP-pileup we will formulate two lists of bam file names for
snp-pileup exec to call upon.
SAMtools
(v.1.14.0) and GATK were used to acquire sequence quality
statistics.
Tutorial
Depth of Sequencing Coverage
1. Sequencing coverage: percent (or total number) of target bases that
have been sequenced relative to a reference set of bases. In WES
different platforms, having different bait sets that sequence different
regions of the genome and therefore will have different references.
Therefore, providing a reference set of sequenced regions will not only
speed up computatinal methods for determining sequencing depth, but it
will also enable accurate quantification of sequencing depth. WES
sequences all bases included in the genome.
2. Sequencing depth: how many times each coverage base is sequenced on
average. Sequencing depth is an important consideration to any
next-generation sequencing (NGS) analysis, because it plays a critical
role in our ability to call germline and somatic mutations. The higher
(deeper) the sequencing depth, the more statistically powered we are to
identify true alterations and differentiate them from artifacts (i.e.,
false positives). For example, when calling germline alterations we
expect the alterations to have a variant allele fraction of 50% (i.e.,
alteration at locus on 1 copy of chromosome). One way we can quickly
determine if the germline alteration call is a false positive is via a
binomial distributionwhere the expected probability of success is 0.5
binom.test(alt_reads, 10, p=0.5). If the probability that
the proportion of alternate reads given the sequencing depth
significantly deviates from the expected VAF of a germline alteration
(50%), then the germline alteration call is likely an artifact. As we
can see from the plots above, the higher the sequencing depth, the more
confident we are that observed VAF for a true germline alteration call
will be closer to 50%. Conversely, looking at the top left panel
(sequencing depth of 10), we cannot confidently say that a germline
alteration with an observed VAF of 30% significantly deviates from the
expected VAF of 50%. In practice, one should also look at IGV snapshots
of the germline alterations when performing artifact filtering (if
feasible). For somatic mutation calling, sequencing depth primarily
affects our ability to call subclonal mutations (i.e. mutations that are
only in a fraction of tumor cells). The lower the cancer cell fraction
(CCF), the higher the sequence coverage required to confidently call the
mutations. The ABSOLUTE paper provides an excellent explanation on how
tumor purity, local copy number, and sequencing depth affect the power
to call mutations.
DepthOfCoverage (GATK3.7)
The descriptions of the inputs are as follows:
sample_ids_NT <- gsub("_sorted_nodup.bam", "", file_list)
write.table(sample_ids_NT, "~/MRes_project_1/Codes/3_therapy_resp/sample_ids_NT.txt",
row.names = FALSE, quote = FALSE)
Reference FASTA file
-R Reference FASTA file for the genome reference used (e.g. Hg19 or
Hg38)
/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/sure_select/hg38.fa
/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/sure_select/hg38.fa.fai
The genomic intervals (regions)
-L The genomic intervals (regions) that we want coverage information on.
For whole-exomes these would be the regions captured by the bait
kits. For genomes this can be left empty.
/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/sure_select/S07604514_Covered.bed
BedToIntervalList
(Picard)
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=2:00:00
#PBS -N interval_list
## change to correct directory
cd /rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/sure_select
## download the fasta file and make fasta.fai file as well
if [ ! -f hg38.fa.fai ]; then
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/latest/hg38.fa.gz ## download
gunzip hg38.fa.gz ## unzip
samtools faidx hg38.fa ## create fasta.fai
else
find . -type f -name 'hg38.fa.fai'
fi
## create interval list
if [ ! -f S07604514_Covered.interval_list ]; then
## use the GATK BedToIntervalList function to create interval_list
java -jar /rds/general/user/ph323/home/anaconda3/envs/Renv/share/picard-2.18.23-0/picard.jar BedToIntervalList \
I=S07604514_Covered.bed \
O=S07604514_Covered.interval_list \
SD=hg38.dict
else
find . -type f -name 'S07604514_Covered.interval_list' ## if exist, print S07604514_Covered.interval_list
fi
Min Base Quality
minBaseQuality The minimum base quality, determined by the sequencer, to
be counted in the coverage metrics
Phred-scaled
quality scores
Finding the minMappingQuality
–minMappingQuality The minimum mapping quality, determined by the
alignment algorithm (e.g. BWA), to be counted in the coverage metrics. A
threshold that’s too high may exclude too many reads, while a threshold
that’s too low may include low-quality alignments.
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=2:00:00
#PBS -N mapping_quality
# Directory containing BAM files
BAM_DIR="/rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov"
# Output directory for the mapping quality files
OUTPUT_DIR="$HOME/MRes_project_1/docs/HH_ova/facets/mapping_quality"
cd ~/MRes_project_1/Codes/3_therapy_resp
# Loop through each line in sample_ids.txt
while IFS= read -r sample_ids_NT; do
# Construct the full path to the BAM file
bam_file="${BAM_DIR}/${sample_ids_NT}_sorted_nodup.bam"
# Construct the full path for the output file
output_file="${OUTPUT_DIR}/${sample_ids_NT}_mapping_quality.txt"
# Check if the BAM file exists
if [ -f "$bam_file" ]; then
# Run samtools and store the output in the specified output file
samtools view "$bam_file" | awk '{print $5}' | sort | uniq -c > "$output_file"
echo "Processed $bam_file and output to $output_file"
else
echo "BAM file not found: $bam_file"
fi
done < "sample_ids_NT.txt"
Add Read Group Information
cd /rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov
java -jar /rds/general/user/ph323/home/anaconda3/envs/Renv/share/picard-2.18.23-0/picard.jar AddOrReplaceReadGroups \
I=X991N_sorted_nodup.bam \
O=X991N_sorted_nodup_rg.bam \
RGID=X991N_2024-01-19_1 \
RGLB=SureSelectV6_r2 \
RGPL=ILLUMINA \
RGPU=unit1 \
RGSM=X991
Performing GATK Depth of Coverage Analysis
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=2:00:00
#PBS -N gatk_score
cd /rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/sure_select/
# Directory containing BAM files
BAM_DIR="/rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov"
# Output directory for the mapping quality files
OUTPUT_DIR="$HOME/MRes_project_1/docs/HH_ova/facets/base_quality"
minMappingQuality=20
minBaseQuality=30
refFasta="hg38.fasta"
intervalList="S07604514_Regions.bed"
while IFS= read -r sample_id; do
bam_file="${BAM_DIR}/X991N_sorted_nodup_rg.bam"
output_file="${OUTPUT_DIR}/${sample_id}_depthOfCoverage"
if [ -f "$bam_file" ]; then
java -Xmx15g -jar /rds/general/user/ph323/home/anaconda3/envs/Renv/opt/gatk-3.8/GenomeAnalysisTK.jar \
-R ${refFasta} \
-T DepthOfCoverage \
-o ${output_file} \
-omitBaseOutput \
-I ${bam_file} \
-L ${intervalList} \
--minBaseQuality ${minBaseQuality} \
--minMappingQuality ${minMappingQuality}
echo "Processed $bam_file and output to $output_file"
else
echo "BAM file not found: $bam_file"
fi
done < "sample_ids_NT.txt"
## extract the files that is not paired
# Assuming you're working in the directory containing the files:
file_list <- list.files(path = "/rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov", pattern = "_sorted_nodup.bam$")
sample_ids <- sub("(X[0-9]+)[NT]_sorted_nodup.bam", "\\1", file_list) ## extract sample ids
table_ids <- table(sample_ids)
## unpaired files
unpaired_ova_bam <- file_list[sample_ids %in% names(table_ids[table(sample_ids) == 1])]
## paired files
paired_ova_bam <- file_list[sample_ids %in% names(table_ids[table(sample_ids) == 2])]
## Process the bam files
paired_tumour <- paired_ova_bam[grep("T_sorted_nodup.bam$", paired_ova_bam)] %>% sort ## get all the tumour samples
paired_normal <- paired_ova_bam[grep("N_sorted_nodup.bam$", paired_ova_bam)] %>% sort ## get all the normal samples
unpaired_tumour <- unpaired_ova_bam[grep("sorted_nodup.bam", unpaired_ova_bam)]
unpaired_normal <- rep("X991N_sorted_nodup.bam", length(unpaired_tumour))
## combine the files
normal <- c(unpaired_normal, paired_normal) ## the first 21 samples will not be matched
tumour <- c(unpaired_tumour, paired_tumour)
## check if the paired bam files are matched up by randomly sample a few
## for samples from 22 to 119, the sample_ids should be identical.
## for samples from 1 to 21, the normal sample should all be X991N.
set.seed(1234)
random <- sample(1:119, 12, replace = FALSE)
for(i in random){
print(c(i, normal[i], tumour[i]))
}
## process
write.table(tumour, "~/MRes_project_1/Codes/3_therapy_resp/tumour_order.txt",
row.names = FALSE, quote = FALSE)
write.table(normal, "~/MRes_project_1/Codes/3_therapy_resp/normal_order.txt",
row.names = FALSE, quote = FALSE)
write.table(unique(sample_ids), "~/MRes_project_1/Codes/3_therapy_resp/sample_ids.txt",
row.names = FALSE, quote = FALSE)
The input VCF file should contain polymorphic SNPs, so that FACETS can infer changes in allelic configuration at genomic loci from changes in allele ratios. dbSNP is a good source for this. By default, snp-pileup also estimates the read depth in the input BAM files every 50th base.
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=24:00:00
#PBS -N snp_pileup
## first switch to result directory to check if this step has been done for all samples
cd ~/MRes_project_1/docs/HH_ova/facets/facet_input/
file_count=$(find . -type f -name '*_ordered.csv' | wc -l) ## count the number of processed files
N=119 ## number of samples
# Compare file count
if [ $file_count -lt $N ]; then
## change directory to where the bam files were stored
cd /rds/general/user/ph323/ephemeral/HH_ova/Alignment_ov
## set dead parameters:
snp_pileup_path="/rds/general/user/ph323/home/MRes_project_1/Codes/3_therapy_resp/pileup_exe/snp-pileup"
params="-q15 -Q20 -P100 -r25,0"
vcf="/rds/general/user/ph323/home/MRes_project_1/docs/snp_pileup/sorted_vcf_file.vcf.gz"
## set live parameters:
normal_bams=($(cat /rds/general/user/ph323/home/MRes_project_1/Codes/3_therapy_resp/normal_order.txt))
tumour_bams=($(cat /rds/general/user/ph323/home/MRes_project_1/Codes/3_therapy_resp/tumour_order.txt))
# Loop over the BAM files
for ((i=1; i<=$N; i++)); do
## Adjust index for zero-based array
adjusted_index=$((i - 1))
## Extract normal and tumor bam file paths
normal_bam="${normal_bams[$adjusted_index]}"
tumor_bam="${tumour_bams[$adjusted_index]}"
## Construct output file path
output_file_name=$(basename "${tumor_bam}" "T_sorted_nodup.bam")
output_file="~/MRes_project_1/docs/HH_ova/facet_input/${output_file_name}.csv"
## Construct the command
cmd="$snp_pileup_path $params $vcf $output_file $normal_bam $tumor_bam"
# Run the command
echo "Running: $cmd"
eval $cmd
echo "Finished piling ${output_file_name}"
done
else
find . -type f -name '*_ordered.csv' | wc -l
fi
Reorder the pileup file and gunzip them for subsequent run-facets wrapper
## reorder the snp-pileup outputs before executing facets
for(cancer in sample_ids){
## output file:
setwd("~/MRes_project_1/docs/HH_ova/facets/facet_input")
output <- paste0(cancer, "_ordered.csv")
gzip_output <- paste0(cancer, "_ordered.csv.gz")
if (!file.exists(output)) {
## reorder the snp_pileup files to 1, 2, 3... 22, X, Y format
temp <- read.csv(paste0("~/MRes_project_1/docs/HH_ova/facets/facet_input/", cancer, ".csv")) ## read in files
chrom_order <- c(as.character(1:22), "X", "Y") ## set up the order
temp <- temp[order(match(temp$Chromosome, chrom_order), temp$Position), ] ## reorder
write.csv(temp, paste0("~/MRes_project_1/docs/HH_ova/facets/facet_input/", cancer, "_ordered.csv"),
quote = FALSE, col.names = TRUE, row.names = FALSE) ## save the files
} else if (!file.exists(gzip_output)){
temp <- read.csv(paste0("~/MRes_project_1/docs/HH_ova/facets/facet_input/", cancer, "_ordered.csv"))
write.csv(temp, gzfile(paste0("~/MRes_project_1/docs/HH_ova/facets/facet_input/", cancer, "_ordered.csv.gz")),
row.names = FALSE)
}
}
facetsSuite is an R package with functions to run FACETS—an allele-specific copy-number caller for paired tumor-normal DNA-sequencing data from genome-wide and targeted assays. facetSuite both wraps the code to execute the FACETS algorithm itself as well as performs post-hoc analyses on the resulting data. This package was developed by members of the Taylor lab and the Computational Sciences group within the Center for Molecular Oncology at Memorial Sloan Kettering Cancer Center.
This wrapper takes above SNP “pileup” as input and executes the FACETS algorithm. The ouputs are in the form of Rdata objects, TXT files, and PNGs of the samples overall copy-number profile. The wrapper allows for running FACETS in a two-pass mode, where first a “purity” run estimates the overall segmentation profile, sample purity and ploidy, and subsequently the dipLogR value from this run seeds a “high-sensitivity” run which may detect more focal events. To run in the two-pass mode, specify the input arguments prefixed by purity. The cval (–purity-cval and –cval) parameters tune the segmentation coarseness.
#!/bin/bash
#PBS -l select=1:ncpus=1:mem=100gb
#PBS -l walltime=24:00:00
#PBS -N run_facet
# Set Job Home Directory
JOB_HOME=/rds/general/user/ph323/home/MRes_project_1/Codes/3_therapy_resp
cd $JOB_HOME
# Assuming sample_ids.txt is in the JOB_HOME directory
# Use full path if the file is somewhere else
SAMPLE_IDS_PATH="$JOB_HOME/sample_ids.txt"
if [ -f "$SAMPLE_IDS_PATH" ]; then
mapfile -t cancer_list < "$SAMPLE_IDS_PATH"
# Loop through each cancer type
for cancer in "${cancer_list[@]}"; do
counts_file="/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/facets/facet_input/${cancer}_ordered.csv.gz"
directory="/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/facets/facet_output/${cancer}"
mkdir -p "$directory"
$JOB_HOME/facets-suite/run-facets-wrapper.R \
--counts-file "$counts_file" \
--sample-id ${cancer} \
--directory "$directory" \
--everything \
--genome 'hg38' \
--cval 25 \
--purity-cval 50 \
--facets-lib-path "/rds/general/user/ph323/home/anaconda3/envs/Renv/lib/R/library"
echo "Processed ${cancer}"
done
else
echo "Error: sample_ids.txt not found in $SAMPLE_IDS_PATH"
exit 1
fi
The above command runs FACETS in the two-pass mode, first at cval 50, then at cval 25 based on the sample-specific baseline found at the higher cval. The full suite of analysis and QC is run with the –everything flag. If no output directory is specified, a directory named sample-id is created.
Create segmentation files.
facet_process <- function(sample_ids, seg_file_type){
## sample_id from sample_ids
## seg_file_type: _hisens_diplogR.adjusted.seg, _hisens_diplogR.unadjusted.seg,
## _purity_diplogR.adjusted.seg, _purity_diplogR.unadjusted.seg
## build an empty dataframe
temp <- matrix(NA, nrow = 1, ncol = 6) %>% as.data.frame
colnames(temp) <- c("ID", "chrom", "loc.start", "loc.end", "num.mark", "seg.mean")
## loop through each directory and get the dataframe
for(sample_id in sample_ids){
## set directory and switch to directory
directory <- paste0("/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/facets/facet_output/", sample_id)
setwd(directory)
## import hisens_adj segmented file
segmented_file <- read.table(paste0(sample_id, seg_file_type), header = TRUE)
temp <- rbind(temp, segmented_file)
}
## remove the first NA row
temp <- temp[2:nrow(temp), ]
## print the result
print(gsub("_|\\.", " ", seg_file_type))
print(summary(temp))
## write the file
setwd("/rds/general/user/ph323/home/MRes_project_1/docs/HH_ova/facets/segmentation_file")
write.table(temp, seg_file_type, quote = FALSE, row.names = FALSE, col.names = FALSE, sep = "\t")
## return the file
return(temp)
}
hisens_adj <- facet_process(sample_ids, "_hisens_diplogR.adjusted.seg")
hisens_unadj <- facet_process(sample_ids, "_hisens_diplogR.unadjusted.seg")
purity_adj <- facet_process(sample_ids, "_purity_diplogR.adjusted.seg")
purity_unadj <- facet_process(sample_ids, "_purity_diplogR.unadjusted.seg")
Visualise our segmentation file