First, prepare a list of somatic mutations in the tidy format. For instance, it may look like:
sample chr posi ReferenceAllele MutantAllele VariantType Effect VAF
SCMh0011-Spe001 10 135165604 GG AA DNV altering 0.0704295393796233
SCMh0011-Spe001 10 17024500 G T SNV altering 0.00812557710064635
SCMh0011-Spe001 10 25886955 G A SNV non-aa-change 0.00308150037190522
SCMh0011-Spe001 10 25887075 GG AA DNV altering 0.0042849387311928
SCMh0011-Spe001 10 27702997 C T SNV non-aa-change 0.00692584726198172
SCMh0011-Spe001 10 49939423 C T SNV non-aa-change 0.00189147362990985
SCMh0011-Spe001 10 68857458 G A SNV non-aa-change 0.00600043168573279
SCMh0011-Spe001 1 103379969 G A SNV non-aa-change 0.00703169432811127
SCMh0011-Spe001 1 111216246 G C SNV altering 0.00281951532888397
. . .
The somatic mutation file can have as many columns as you like and may or may not have a header line, but the first 5 columns must have the following variables in the same order as shown below:
1. SampleID
2. Chromosome (1,2,...,X,Y)
3. Genomic coordinate
4. Reference Allele
5. Mutant Allele
For example, the following file also works fine:
SAMPLE_ID CHROM POS REF ALT Protein_affecting TYPE
P19_0024_BTR_01 1 26729450 T C non_protein_affecting SNV
P19_0014_BDO_01 1 26729474 TAG T non_protein_affecting DELETION
P19_0023_BDO_01 1 26729479 TTC T non_protein_affecting DELETION
P19_0023_BDO_01 1 26729563 G C non_protein_affecting SNV
P19_0040_BTR_01 1 26729566 TA T non_protein_affecting DELETION
. . .
Download an R script getFlank.R into your project directory in the vortex cluster. Depending on the reference genome you use, you must modify the reference gene in the script:
prefix="/projects/rpci/songliu/mkorobkin/projects/Lei/motif/"
#rfile=paste0(prefix, "ref/", chr, ".fa") ## human
#rfile=paste0(prefix, "ref2/", chr, ".fa") ## mouse
rfile=paste0(prefix, "ref3/chr", chr, ".fa") ## human(GRCh38)
Reference gene files in the above directories are split into chromosomes, since both evaluation of motif density in the reference genome and normalization of motif counts are done chromosome by chromosome. If you want to generate reference gene files in your project directory, then you can use such bash script as follows:
#rfile=/projects/rpci/shared/reference/human_hs37d5/hs37d5.fa ## hg19
rfile=/projects/rpci/shared/reference/GRCh38/GRCh38.primary_assembly.genome.fa ##hg38
csplit -s -z $rfile '/>/' '{*}'
for i in xx* ; do
n=$(sed 's/>// ; s/ .*// ; 1q' "$i") ;
mv "$i" "$n.fa" ;
done
Next, download a slurm batch file getFlank.slurm into your project directory. Modify it to suit your job, and submit by typing:
sbatch ./getFlank.slurm
Then, output files will be written in the OUTDIR
variable you specify in the slurm batch file. They’ll have everything in
the input file plus a flanking sequence of +/-6 bases around the
SNV/DNV/TNV positions added in the last field, e.g.,
SCMh0031-Spn006 1 7724254 C T SNV non-aa-change 0.0023117999479031 GTACTCCTCCTCC
SCMh0034-Spn003 1 7724256 C T SNV altering 0.00121970343782126 ACTCCTCCTCCGC
SCMh0039-Spe005 1 7724259 C T SNV altering 0.00521587945523037 CCTCCTCCGCGGC
SCMh0047-Spn007 1 7724262 C T SNV altering 0.00131621187800963 CCTCCGCGGCGGC
SCMh0013-Spe004 1 7724283 C T SNV altering 0.00131842828492906 CCAGCTCCCTCAC
SCMh0013-Spe004 1 7724284 C T SNV non-aa-change 0.00136161005726223 CAGCTCCCTCACC
SCMh0016-Spe002 1 7724284 C T SNV non-aa-change 0.00604823802552386 CAGCTCCCTCACC
SCMh0044-Spe005 1 7724284 C T SNV non-aa-change 0.0199387461716357 CAGCTCCCTCACC
SCMh0013-Spe005 1 7724289 C T SNV altering 0.00135843484680694 CCCTCACCCTGAC
SCMh0034-Spe003 1 7724299 C T SNV non-aa-change 0.00119746487923129 GACCGCCGGCTCC
. . .
Next, download an R script count7mer.R into your
project directory in the vortex cluster. Then, modify the reference gene
file (i.e., rfile variable) in the R script, so it is
consistent to the last section.
prefix="/projects/rpci/songliu/mkorobkin/projects/Lei/motif/"
#rfile=paste0(prefix, "ref/", chr, ".fa") ## human
#rfile=paste0(prefix, "ref2/", chr, ".fa") ## mouse
rfile=paste0(prefix, "ref3/chr", chr, ".fa") ## human(GRCh38)
Download a slurm batch file count7mer.slurm. In this
batch script, you need to specify a BED file (i.e., rgfile
variable) which shows the target regions of your study.
#rgfile=/projects/rpci/songliu/mkorobkin/projects/RQ022869-Paragh/PGD770.coveredRegion_HumanWide_hg19_clean.bed ## human
#rgfile=/projects/rpci/songliu/mkorobkin/projects/RQ024165-Paragh_1/PGD843.coveredRegion_Mouse_clean.bed ## mouse
rgfile=/projects/rpci/songliu/mkorobkin/projects/Lei/motif/data/Calvet2025/Calvet2025_target_region.bed ## Calvet2025
For example, a bed file may look like:
1 26729650 26729863 NM_006015.6_cds_1_0_chr1_26729651_f
1 26731151 26731604 NM_006015.6_cds_2_0_chr1_26731152_f
1 26732675 26732792 NM_006015.6_cds_3_0_chr1_26732676_f
1 26760855 26761096 NM_006015.6_cds_4_0_chr1_26760856_f
1 26761383 26761473 NM_006015.6_cds_5_0_chr1_26761384_f
. . .
Note that the 4th field of a BED file (region names) can be empty.
The count7mer.R script counts the occurrance of 7-base
motifs in each target region chromosome by chromosome in both forward
and backward directions. Submit your job by typing:
sbatch ./count7mer.slurm
Output files will be written in the OUTDIR variable you
specify in the slurm batch file. There will be output for each
chrommosome (e.g., 7mer_[projectID].[chr].dat):
aaaaaaa 0 0
aaaaaac 0 0
aaaaaag 0 1
aaaaaat 1 0
aaaaaca 0 2
. . .
In this third step, motifs in the flanking sequence (which was found
in the first step) are counted with repetition in each chromosome, both
in the forward and backward directions combined.
Download both an R script: statMotif.R and a slurm
batch script: statMotif.slurm. Modify
them, and submit job as follows:
sbatch ./statMotif.slurm
Output file will have 3 columns: motif, repeated counts (\(=n_i\)), and number of occurrance in target regions (\(=N_i\)).
gatttcc 109 2
ttccgct 107 3
tttccgc 107 2
atttccg 106 3
ccgcttt 105 3
tccgctt 105 3
ccggaag 99 3
ggatttc 98 2
...
Here, motifs are sorted in the descending order of repeated counts (the 2nd field).
Download a slurm batch script statall.slurm, modify, and submit:
sbatch ./statall.slurm
This batch script is a single-CPU job and sums up the repeated counts from all chromosomes with a weight; that is, \(\sum_i(n_i/N_i)\). Note that weight for a motif will be different for each chromosome, as we counted \(N_i\) for each chromosome separately (in the second step).
cggaagg 1249.08 2497
ccttccg 1249.08 2497
cttccgg 578.983 1409
ccggaag 578.983 1409
ggaaggc 467.071 497
gccttcc 467.071 497
...
The 2nd field is the weighted sum, and the 3rd field is the non-weighted sum (\(=\sum_i n_i\)) of repeated counts.
In the last step, each somatic mutation is matched with only one motif in its flanking sequence that had the largest repeated (and weighted) counts in the third step.
Download both an R script: whichMotif.R and a slurm batch script: whichMotif.slurm. Modify them, and submit job as follows:
sbatch ./whichMotif.slurm
Output file
([ProjectID]_mutations_non-aa-change_motifs+weight.[chr].dat)
will have \(x+4\) columns, where \(x\) is the number of fields in the input
file (i.e., the somatic mutation data in the first step).
1-x. The same as input file (x is the number of fields)
x+1. Flanking sequence in the forward direction
x+2. Flanking sequence in the backward direction
x+3. Motif information (e.g., CCTTCCG:+:5:C>T)
x+4: Weight
The motif information contains motif, direction of motif-containing strand (+ or -), the location of mutation [1-7], and mutation type, separated by colons (:). The weight is \(\sum_i (n_i/N_i)\) found in the third step and different for each chromosome.
Finally, to obtain the list of motifs in the order of non-repeated counts from this output file: download a slurm batch script: statall2.slurm. Modify them, and submit job as follows:
sbatch ./statall2.slurm
Output file ([ProjectID]_statall2_weighted.out) will
have the non-repeated (and weighted) counts of motifs from all
chromosomes: e.g.,
CCTTCCG CGGAAGG 1250.08 2498
CTTCCGG CCGGAAG 199.2 545
TTTCCGC GCGGAAA 112.5 167
GTTTCCG CGGAAAC 104 104
GTTTTCC GGAAAAC 76.9286 136
CTCTATA TATAGAG 70 70
TTTTCCT AGGAAAA 53.05 202
CTTCCGC GCGGAAG 52.1667 108
CTCCGTC GACGGAG 46 56
ATATCCT AGGATAT 45.5 46
...
From the first to the 4th column, they are 1) the motif, 2) its reverse complement, 3) non-repeated and weighted counts, and 4) non-repeated and un-weighted counts.