1 Find Flanking Sequance

First, prepare a list of somatic mutations in the tidy format. For instance, it may look like:

sample  chr posi    ReferenceAllele MutantAllele    VariantType Effect  VAF
SCMh0011-Spe001 10  135165604   GG  AA  DNV altering    0.0704295393796233
SCMh0011-Spe001 10  17024500    G   T   SNV altering    0.00812557710064635
SCMh0011-Spe001 10  25886955    G   A   SNV non-aa-change   0.00308150037190522
SCMh0011-Spe001 10  25887075    GG  AA  DNV altering    0.0042849387311928
SCMh0011-Spe001 10  27702997    C   T   SNV non-aa-change   0.00692584726198172
SCMh0011-Spe001 10  49939423    C   T   SNV non-aa-change   0.00189147362990985
SCMh0011-Spe001 10  68857458    G   A   SNV non-aa-change   0.00600043168573279
SCMh0011-Spe001 1   103379969   G   A   SNV non-aa-change   0.00703169432811127
SCMh0011-Spe001 1   111216246   G   C   SNV altering    0.00281951532888397
. . .

The somatic mutation file can have as many columns as you like and may or may not have a header line, but the first 5 columns must have the following variables in the same order as shown below:

1. SampleID  
2. Chromosome (1,2,...,X,Y)
3. Genomic coordinate 
4. Reference Allele
5. Mutant Allele

For example, the following file also works fine:

SAMPLE_ID   CHROM   POS REF ALT Protein_affecting   TYPE
P19_0024_BTR_01 1   26729450    T   C   non_protein_affecting   SNV
P19_0014_BDO_01 1   26729474    TAG T   non_protein_affecting   DELETION
P19_0023_BDO_01 1   26729479    TTC T   non_protein_affecting   DELETION
P19_0023_BDO_01 1   26729563    G   C   non_protein_affecting   SNV
P19_0040_BTR_01 1   26729566    TA  T   non_protein_affecting   DELETION
. . .

Download an R script getFlank.R into your project directory in the vortex cluster. Depending on the reference genome you use, you must modify the reference gene in the script:

prefix="/projects/rpci/songliu/mkorobkin/projects/Lei/motif/"
#rfile=paste0(prefix, "ref/", chr, ".fa")  ## human
#rfile=paste0(prefix, "ref2/", chr, ".fa")  ## mouse
rfile=paste0(prefix, "ref3/chr", chr, ".fa")  ## human(GRCh38)

Reference gene files in the above directories are split into chromosomes, since both evaluation of motif density in the reference genome and normalization of motif counts are done chromosome by chromosome. If you want to generate reference gene files in your project directory, then you can use such bash script as follows:

#rfile=/projects/rpci/shared/reference/human_hs37d5/hs37d5.fa  ## hg19
rfile=/projects/rpci/shared/reference/GRCh38/GRCh38.primary_assembly.genome.fa  ##hg38
csplit -s -z $rfile '/>/' '{*}'

for i in xx* ; do 
  n=$(sed 's/>// ; s/ .*// ; 1q' "$i") ;
  mv "$i" "$n.fa" ; 
done

Next, download a slurm batch file getFlank.slurm into your project directory. Modify it to suit your job, and submit by typing:

sbatch ./getFlank.slurm

Then, output files will be written in the OUTDIR variable you specify in the slurm batch file. They’ll have everything in the input file plus a flanking sequence of +/-6 bases around the SNV/DNV/TNV positions added in the last field, e.g.,

SCMh0031-Spn006 1   7724254 C   T   SNV non-aa-change   0.0023117999479031  GTACTCCTCCTCC
SCMh0034-Spn003 1   7724256 C   T   SNV altering    0.00121970343782126 ACTCCTCCTCCGC
SCMh0039-Spe005 1   7724259 C   T   SNV altering    0.00521587945523037 CCTCCTCCGCGGC
SCMh0047-Spn007 1   7724262 C   T   SNV altering    0.00131621187800963 CCTCCGCGGCGGC
SCMh0013-Spe004 1   7724283 C   T   SNV altering    0.00131842828492906 CCAGCTCCCTCAC
SCMh0013-Spe004 1   7724284 C   T   SNV non-aa-change   0.00136161005726223 CAGCTCCCTCACC
SCMh0016-Spe002 1   7724284 C   T   SNV non-aa-change   0.00604823802552386 CAGCTCCCTCACC
SCMh0044-Spe005 1   7724284 C   T   SNV non-aa-change   0.0199387461716357  CAGCTCCCTCACC
SCMh0013-Spe005 1   7724289 C   T   SNV altering    0.00135843484680694 CCCTCACCCTGAC
SCMh0034-Spe003 1   7724299 C   T   SNV non-aa-change   0.00119746487923129 GACCGCCGGCTCC
. . .

2 Count 7-base Motifs in Reference Gene

Next, download an R script count7mer.R into your project directory in the vortex cluster. Then, modify the reference gene file (i.e., rfile variable) in the R script, so it is consistent to the last section.

prefix="/projects/rpci/songliu/mkorobkin/projects/Lei/motif/"
#rfile=paste0(prefix, "ref/", chr, ".fa")  ## human
#rfile=paste0(prefix, "ref2/", chr, ".fa")  ## mouse
rfile=paste0(prefix, "ref3/chr", chr, ".fa")  ## human(GRCh38)

Download a slurm batch file count7mer.slurm. In this batch script, you need to specify a BED file (i.e., rgfile variable) which shows the target regions of your study.

#rgfile=/projects/rpci/songliu/mkorobkin/projects/RQ022869-Paragh/PGD770.coveredRegion_HumanWide_hg19_clean.bed   ## human
#rgfile=/projects/rpci/songliu/mkorobkin/projects/RQ024165-Paragh_1/PGD843.coveredRegion_Mouse_clean.bed    ## mouse
rgfile=/projects/rpci/songliu/mkorobkin/projects/Lei/motif/data/Calvet2025/Calvet2025_target_region.bed  ## Calvet2025

For example, a bed file may look like:

1   26729650    26729863    NM_006015.6_cds_1_0_chr1_26729651_f
1   26731151    26731604    NM_006015.6_cds_2_0_chr1_26731152_f
1   26732675    26732792    NM_006015.6_cds_3_0_chr1_26732676_f
1   26760855    26761096    NM_006015.6_cds_4_0_chr1_26760856_f
1   26761383    26761473    NM_006015.6_cds_5_0_chr1_26761384_f
. . .

Note that the 4th field of a BED file (region names) can be empty. The count7mer.R script counts the occurrance of 7-base motifs in each target region chromosome by chromosome in both forward and backward directions. Submit your job by typing:

sbatch ./count7mer.slurm

Output files will be written in the OUTDIR variable you specify in the slurm batch file. There will be output for each chrommosome (e.g., 7mer_[projectID].[chr].dat):

aaaaaaa 0   0
aaaaaac 0   0
aaaaaag 0   1
aaaaaat 1   0
aaaaaca 0   2
. . .

3 Count 7-base Motifs in SNV data

In this third step, motifs in the flanking sequence (which was found in the first step) are counted with repetition in each chromosome, both in the forward and backward directions combined.
Download both an R script: statMotif.R and a slurm batch script: statMotif.slurm. Modify them, and submit job as follows:

sbatch ./statMotif.slurm

Output file will have 3 columns: motif, repeated counts (\(=n_i\)), and number of occurrance in target regions (\(=N_i\)).

gatttcc 109 2
ttccgct 107 3
tttccgc 107 2
atttccg 106 3
ccgcttt 105 3
tccgctt 105 3
ccggaag 99  3
ggatttc 98  2
...           

Here, motifs are sorted in the descending order of repeated counts (the 2nd field).

Download a slurm batch script statall.slurm, modify, and submit:

sbatch ./statall.slurm

This batch script is a single-CPU job and sums up the repeated counts from all chromosomes with a weight; that is, \(\sum_i(n_i/N_i)\). Note that weight for a motif will be different for each chromosome, as we counted \(N_i\) for each chromosome separately (in the second step).

cggaagg 1249.08 2497
ccttccg 1249.08 2497
cttccgg 578.983 1409
ccggaag 578.983 1409
ggaaggc 467.071 497
gccttcc 467.071 497
...

The 2nd field is the weighted sum, and the 3rd field is the non-weighted sum (\(=\sum_i n_i\)) of repeated counts.

4 Assign a Motif per SNV

In the last step, each somatic mutation is matched with only one motif in its flanking sequence that had the largest repeated (and weighted) counts in the third step.

Download both an R script: whichMotif.R and a slurm batch script: whichMotif.slurm. Modify them, and submit job as follows:

sbatch ./whichMotif.slurm

Output file ([ProjectID]_mutations_non-aa-change_motifs+weight.[chr].dat) will have \(x+4\) columns, where \(x\) is the number of fields in the input file (i.e., the somatic mutation data in the first step).

1-x. The same as input file (x is the number of fields)
x+1. Flanking sequence in the forward direction
x+2. Flanking sequence in the backward direction
x+3. Motif information (e.g., CCTTCCG:+:5:C>T)
x+4: Weight 

The motif information contains motif, direction of motif-containing strand (+ or -), the location of mutation [1-7], and mutation type, separated by colons (:). The weight is \(\sum_i (n_i/N_i)\) found in the third step and different for each chromosome.

Finally, to obtain the list of motifs in the order of non-repeated counts from this output file: download a slurm batch script: statall2.slurm. Modify them, and submit job as follows:

sbatch ./statall2.slurm

Output file ([ProjectID]_statall2_weighted.out) will have the non-repeated (and weighted) counts of motifs from all chromosomes: e.g.,

CCTTCCG CGGAAGG 1250.08 2498
CTTCCGG CCGGAAG 199.2 545
TTTCCGC GCGGAAA 112.5 167
GTTTCCG CGGAAAC 104 104
GTTTTCC GGAAAAC 76.9286 136
CTCTATA TATAGAG 70 70
TTTTCCT AGGAAAA 53.05 202
CTTCCGC GCGGAAG 52.1667 108
CTCCGTC GACGGAG 46 56
ATATCCT AGGATAT 45.5 46
...

From the first to the 4th column, they are 1) the motif, 2) its reverse complement, 3) non-repeated and weighted counts, and 4) non-repeated and un-weighted counts.