Introduction

IgIDivA [Immunoglobulin Intraclonal Diversification Analysis] is a purpose-built tool for the analysis of the intraclonal diversification process using high-throughput sequencing data.
It is written in shiny. Every step of the analysis can be performed interactively, thus not requiring any programming skills.
It takes as input the output files “clonotypes_computation” and “grouped_alignment_nt” from the tripr package.
Functions for an R command-line use are also available.

Installation

The IgIDivA scripts can be freely downloaded here. It requires R [version “4.1”], which can be installed on any operating system [e.g., Linux, Windows, MacOS] from CRAN. Installation with Docker will be available in the coming future.

All the packages that need to be installed in the R session are the following:

install.packages("shiny")
install.packages("shinyFiles")
install.packages("fs")
install.packages("pdftools")
install.packages("purrr")
install.packages("DT")
install.packages("bslib")
install.packages("shinyhelper")
install.packages("data.table")
install.packages("stringr")
install.packages("RGenetics")
install.packages("dplyr")
install.packages("ggsci")
install.packages("tidygraph")
install.packages("ggraph")
install.packages("igraph")
install.packages("ggplot2")
install.packages("ggpubr")
install.packages("rstatix")
install.packages("shinyvalidate")

All the scripts from IgIDivA need to be downloaded in the same folder. All the input files should also be stored in a different folder.

Alternatively, IgIDivA can be installed using a conda environment. We recommend to use Miniconda to install all the dependencies. The dependencies can be found in .yml format in the IgIDivA GitHub repository. The yml file and the IgIDivA scripts need to be stored in the working folder. After downloading all the files, a terminal should be opened and the following commands should be written:

conda env create -f IgIDivA.yml 
conda activate IgIDivA
R
install.packages(c("shinyvalidate", "RGenetics","rstatix"))
q()
Rscript app.R

This will produce a url that can be copied in a web browser and will direct the user to the IgIDivA app.

Download an example dataset as Input for IgIDivA

An example dataset to be used as Input for IgIDivA can be found here. The dataset comprises the tripr output files [“highly_sim_all_clonotypes” and “Grouped Alignment_nt] of 26 chronic lymphocytic leukemia (CLL) samples [19 CLL subset #2 samples and 7 CLL subset #169 samples]. The data was retrieved from ENA under the accession number PRJEB36589, and subsequently processed with IMGT/HighV-QUEST and tripr.
Each sample’s data can be downloaded by pressing the button Download.
Alternatively, to download all the data at the same time, the following commands can be used in the R session:

install.packages("zen4R")
library(zen4R)
path = paste0(getwd(), "/Input")
if (!dir.exists(path)){
  dir.create(path)}
zen4R::download_zenodo('10.5281/zenodo.6616046',  path = path)

[The variable “path” can be changed with the location where the user wants to store the Input].
Note: warnings might appear in RStudio indicating that the downloaded length of some files != reported length. This means that not all the length of those files was downloaded [probably due to the Internet speed]. One solution is to increase the ‘downloading’ time in Rstudio, with this command:

options(timeout = max(600, getOption("timeout")))

Running IgIDivA as a shiny application

In order to start the shiny app, the script app.R should be opened in the R session and the button Run App should be pressed.

Import data

In this tab users can create the folders where the results will be stored and import their data.

Import data: Select/Create Output folder

First, the user should specify the Results folder. For that, the user can go to the folder in their computer where they would like to store the output and press copy address as text:

Then, the copied path should be pasted in the area Enter desired path here, together with a “/” and the name of the Results folder that the user wants to create [e.g. “Results”]. If a folder with this name does not exist, it will be created:

Then, the Create Results Path should be pressed.

Import data: Choose input directory

The following step consists on the selection of the Input folder. Following the same approach as for the output folder, the user will enter in the path where the Input files are stored. The tool takes as input for each sample the tripr output files “highly similar clonotype computation” and “grouped alignment nt”, in text format (.txt).
The input folder is selected:

Once the Input folder address has been added, users should verify it by pressing the button Upload. Then users can choose which samples from the Input folder they want to include in the analysis.

Users should subsequently verify the selection by pressing the button Verify. Please, mind the order of the steps. If the output folder is changed, it is necessary to press again the Verify button for the selected samples.

Import data: Including groups to compare (optional)

In order to make comparisons between groups of samples, the user needs to create a tab-delimited file with two columns.
The first column should be named “sample_id” and should include the names of the samples.
The second column should include the name of the group that each sample belongs to. By default the name of the column is “group_name”, but it can be modified in the Enter the name chosen for the second column button. The file would look like this:

An example file can be found here as “SampleGroups.txt”; the samples correspond to the data mentioned before.

Once created, the file can be uploaded through the Browse button. When it is uploaded, a message “Upload completed” will appear. Then, the tab “Set Parameters” should be opened.

Set Parameters

There are different parameters that can be applied:

  • Enter starting column:
    From the Grouped Alignment file, the user can choose which column corresponds to the beginning of the sequence. If the experimental procedure amplifies the whole immunoglobulin with, for example, leader primers, the starting column should be 5 [the initial 4 columns of the file contain additional information]. If the experimental procedure uses primers that bind in a more downstream position, the starting column should be changed [for example, for primers binding to the FR1 region of the immunoglobulin, the starting column position could be 23 or 59, for example, depending on the binding region]. The default is position 5.
  • Enter ending column:
    From the Grouped Alignment file, the user can choose which column corresponds to the end of the sequence [the end of the FR3 region]. The default is position 313.
  • Enter threshold minimum reads for the nodes:
    The user can choose the minimum number of reads that need to be part of a nucleotide variant (node) for it to be considered in the analysis. The default is 10.
  • Enter p-value threshold:
    For the metrics comparison between groups of samples, the user can choose the p-value threshold for a comparison to be considered as statistically significant.The default is 0.05.
  • Do you want the p-values to be adjusted?:
    The user can choose between p-value or adjusted p-value. The default is not-adjusted.

  • Clonotypes to be taken into account for the analysis:
    Option for the user to choose the clonotypes to be included in the analysis. One approach would be, for example, to include the first [the most frequent] clonotype. The default is 1.

Parameters: processing

There are different options for the analysis that can be selected:

  • Summary tables:
    Tables with summary information [regarding nucleotide variants, sequences, mutational level,…] will be produced throughout the process.
  • Jumps between non-adjacent nodes:
    If selected, jumps are allowed and nt vars with common SHMs differing by two or more SHMs will be included.
  • Separate graphs:
    If selected, the graph network of each sample will be separated into two different graphs: on the left, the main nt var and the nt vars with fewer SHMs than the main nt var [the “less mutations pathway”] and on the right the main nt var and the nt vars with additional mutations. The different levels of mutations are aligned in both graph networks. This parameter affects only the visualization. By default this parameter is “off”.
  • Amino-acid mutations:
    The analysis will include the analysis of SHMs at the amino acid level. Replacement mutations will be shown in the graph and tables with the replacement mutations will be produced.
  • Size scaling of nodes proportional to reads:
    If selected, the size of the nodes of the graph networks will be proportional to the number of reads of the respective nucleotide variants.
  • Graph metrics:
    For each sample, different graph metrics will be calculated (description of the metrics below).
  • Graph networks:
    For each sample, a graph network representing the intraclonal diversification will be produced.
  • Metrics comparison:
    If the above Graph metrics option is selected, there is the option of performing metrics comparison between different groups of samples. For this option to work, the “SampleGroups.txt” file has to be added in the Import data tab.

Parameters: metrics

There are different metrics [or related calculations] that can be calculated for the description and determination of the intraclonal diversification level:

  • Main variant identity:
    Percentage (%) of identity of the main nucleotide variant with its respective germline.
  • Relative convergence (reads):
    Graph metric “convergence score”. Ratio of the number of sequences of the most relevant pathways to the number of sequences of the main nucleotide variant. It shows the tendency for the BcR IG sequences to accumulate in the main nt var or to acquire additional convergent SHMs.
  • Most relevant pathway score:
    Each block of pathways that leads to a particular end node gets a score based on the ratio of the total number of sequences of the nodes forming that block of pathways to the total number of sequences of all the nodes of the network with more SHMs than the main nt var. The block with the highest score is the most relevant pathway, the one that will be used for the calculation of the relative convergence, the convergence score.
  • Most relevant pathway score (nodes):
    Number of nodes of the most relevant pathway.
  • End nodes density:
    Graph metric. Ratio of the number of end nodes to the number of nucleotide variants with additional SHMs. It shows the randomness or specificity of the mutational path.
  • Max path length:
    Graph metric. Number of levels of additional SHMs. It shows the complexity of the mutational pathways.
  • Max mutations path length:
    Graph metric “maximal mutational length”. Maximum level of additional SHMs. It shows the complexity of the mutational pathway, allowing non-consecutive SHMs.
  • Total reads:
    Total number of reads of the sample.
  • Average degree:
    Graph metric. Average total number of connections of each nucleotide variant. It shows the complexity and connectivity of the mutational pathways.
  • Average distance:
    Graph metric. Average number of steps along the shortest pathways between each pair of nucleotide variants.

Then, it is possible to choose, among the graph metrics, which one(s) to use to perform comparisons between groups of samples.

Once all the parameters have been selected, the button Start must be pressed. A bar will show how much of the analysis has been completed.

The button Reset can be used to start a new analysis, resetting the parameters [the output results will be reset when pressing the Start button].

When the analysis is finished, a notification will appear with the message ‘File conversion in progress…’. This conversion is performed to allow the visualizations to be visible in the Visualize results tab. Once it is ready, the user will be automatically redirected to the Visualize Results tab.

Visualize Results

This tab shows all the different output results and it offers the possibility of selecting them and choosing which sample to visualize. All the output results are saved locally in the user’s previously selected output folder.

Summary Calculations

For each sample, it shows the number of related clonotypes [clonotypes with the same IGV gene and very similar CDR3] considered for the analysis, the number of nucleotide variants included, the total number of sequences, the number of singletons [nucleotide variants constituted by only one sequence], number of expanded nucleotide variants [nucleotide variants constituted by more than 1 sequence], number of sequences belonging to expanded nucleotide variants, and the number of reads of the main nucleotide variant. [Example shown: sample H33].

Extra Mutations Calculations

For each sample, it shows the number of nt vars with additional SHMs for each given number of SHMs, as well as the total number of sequences. It includes the total number of nt vars and sequences. [Example shown: sample H33].

Less Mutations Calculations

For each sample, it shows the number of sequences lacking SHMs of the main nt var, for each different number of SHMs. [Example shown: sample H33].

Mutations

For each sample, it provides information for all unique SHMs or combinations of SHMs of all the nt vars that are part of the connected graph network. It also shows the number of SHMs in comparison to the germline, the number of sequences with those SHMs and the mutational level to which they belong. The mutational level is “less” if they have fewer SHMs than the main nt var, “main” for the SHMs of the main nt var, and “additional” for the cases with more SHMs than the main nt var. [Example shown: sample AMRMES].

Amino-acid Mutations Main Variant

It provides information of the replacement SHMs in the main nt var of each sample, together with the number of sequences carring each mutation. [Example shown: sample H33].

Global Amino-acid Mutations Main Variant

It contains all identified replacement SHMs in the main nt var of all the samples. It can be used to identify mutational patterns among samples. [Example shown: all samples from example dataset].

Amino-acid Mutations

It contains all identified replacement SHMs in the nt vars with additional SHMs [excluding the ones of the main nt var]. [Example shown: sample H33].

Global Amino-acid Mutations

It contains all identified replacement SHMs in the nt vars with additional SHMs [excluding the ones of the main nt var] for all the samples. It can be used to identify mutational patterns among samples. [Example shown: all samples from example dataset].

Graph Metrics

For each sample, it contains the germline identity %, the values of the graph metrics as well as information related to those metrics. [Example shown: sample H33].

Global Graph Metrics

It shows the graph metrics values for all the samples. If a sample has been discarded, the cause is provided. [Example shown: all samples from example dataset].

Graph Networks

For each sample, it shows the graph network. [Example shown: sample H33].

If the parameter “Separate graphs” is selected, the graph network gets separated in two [nt vars with fewer SHMs than the main nt var on the left and nt vars with additional SHMs on the right]. For example [sample H33]:

Metrics comparison

If samples are classified into groups, the tool performs pairwise comparisons for all groups. This is performed independently for each of the graph metrics. [Example shown: all samples from example dataset].

Discarded Samples

It provides the names of samples that have been discarded from the analysis [e.g. samples with no connections among nt vars].

That’s all! If there is any issue, please feel free to open an issue in the GitHub repository of IgIDivA.

Thank you for using IgIDivA! Enjoy! :)