VDJ_build()
function
Af_build()
function
Af_plot_tree()
function
Af_compare_methods()
function
Af_compare_within_repertoires()
function
Af_compare_across_repertoires()
function
This vignette provides a detailed explanation of the main functions
from the AntibodyForests package, including the VDJ_build()
function from the Platypus package.
These functions are designed for importing single-cell V(D)J sequencing
data from 10x Genomics, constructing lineage trees, comparing the
structure of these trees within and across repertoires, and
incorporating features of protein structure and evolutionary likelihood
from Protein Language Models. Throughout this vignette, the dataset
outlined by Neumeier
et al. (2022) and Kim et al. (2022)
will be used.
This quick start gives a short use case of AntibodyForests. 10x output of five mouse immunized with Ovalbumin (OVA) from Neumeier et al. (2022) are used to create a VDJ dataframe with Platypus. AntibodyForests is used to create lineage trees for each B cell clonotype, these trees are then clustered and the difference in topology between the clusters is analyzed.
# Import 10x Genomics output files into VDJ dataframe and only keep cells with one VDJ and one VJ transcript
# For each clone, trim germline sequence and replace CDR3 region in germline with most-frequently observed CDR3 sequence
VDJ_OVA <- Platypus::VDJ_build(VDJ.directory = "10x_output/VDJ/",
remove.divergent.cells = TRUE,
complete.cells.only = TRUE,
trim.germlines = TRUE)
# Build lineage trees for all clones present in the VDJ dataframe with the default algorithm
AntibodyForests_OVA_default <- Af_build(VDJ = VDJ_OVA,
construction.method = "phylo.network.default")
# Plot one of the lineage trees, constructed with the `phylo.network.default' method
Af_plot_tree(AntibodyForests_object = AntibodyForests_OVA_default,
sample = "S1",
clonotype = "clonotype2")
# Cluster the trees in the repertoire based on the euclidean distance between various metrics of each tree
out <- Af_compare_within_repertoires(input = AntibodyForests_OVA_default,
min.nodes = 5,
distance.method = "euclidean",
distance.metrics = c("mean.depth", "sackin.index", "spectral.density"),
clustering.method = "mediods",
visualization.methods = "PCA",
plot.label = T)
# Analyze the difference between the clusters based on the calculated sackin.index
plots <- Af_cluster_metrics(input = AntibodyForests_OVA_default,
clusters = out$clustering,
metrics = c("spectral.density"),
min.nodes = 5,
significance = T)
VDJ_build()
functionThe VDJ_build()
function imports Cell
Ranger output into an R dataframe, priming the data for downstream
analyses. It is optimized for Cell Ranger v7 and older versions and is
suitable for B and T cell repertoires. Upon execution, the function
generates a VDJ dataframe where VDJ and VJ transcripts are paired (if
present) in each row, with each row representing a single cell. Post VDJ
dataframe construction, a table is displayed on the console, delineating
the number of filtered cells, contingent upon the
remove.divergent.cells
and complete.cells.only
parameters.
For seamless integration with functions such as the
Af_build()
function (discussed next), it is strongly
recommended to set the remove.divergent.cells
and
complete.cells.only
parameters to TRUE. Cells with more
than one VDJ or VJ transcript, referred to as ‘divergent cells’,
typically represent doublets that may arise during the single-cell
sorting, or show dual expression of kappa and lambda light chains. Such
occurrences are rare, accounting for only 0.2-0.5% of peripheral blood B
cells, as reported by Giachino et
al. (1995). Cells harboring both a single VDJ and a VJ transcript
may be included for comprehensive downstream analyses.
The output of 10x Genomics contains one output folder per
run/sample. In each sample folder, the following files should be present
in order to run the VDJ_build()
function
successfully.
filtered_contig_annotations.csv
filtered_contig.fasta
file. These annotations
are required to obtain the trimmed V(D)J sequences.filtered_contig.fasta
filtered_contig_annotations.csv
.consensus_annotations.csv
consensus.fasta
file.consensus.fasta
concat_ref.fasta
trim.and.align
parameter).all_contig_annotations.json
file will be employed to obtain
the ‘L-REGION+V-REGION’ and ‘J-REGION’ annotations. The raw sequences
can now be trimmed using the ‘L-REGION+V-REGION’ start and ‘J-REGION’
end indices. If the all_contig_annotations.json
file is not
found in the sample/run directory, the columns with the trimmed
sequences will remain empty and a warning message is printed.VDJ.directory
VDJ.sample.list
parameter
should not be specified)."/path/to/VDJ/directory"
VDJ.sample.list
VDJ.directory
parameter should not be
specified).c("/path/to/sample1", "/path/to/sample2")
remove.divergent.cells
TRUE
, cells with more
than one VDJ transcript or more than one VJ transcript will be excluded
from the VDJ dataframe.FALSE
complete.cells.only
TRUE
, only cells with
both a VDJ transcript and a VJ transcript are included in the VDJ
dataframe.FALSE
trim.germlines
TRUE
, the raw germline
sequence of each clone will be aligned with the consensus sequence of
that clone serving as the reference sequence, utilizing the
Biostrings::pairwiseAlignment()
function. The alignment
conducted will be of the ‘global-local’ type, therebyy trimming the raw
germline sequence.FALSE
fill.germline.CDR3
Biostrings::pairwiseAlignment()
function. The
alignment conducted will be of the ‘global-local’ type, thereby
selecting the CDR3 region from the trimmed germline sequence. After the
alignment, the germline CDR3 sequence is replaced by the most frequently
observed CDR3 sequence, in order to obtain germline sequences that are
more likely to encode producible and productive antibodies.FALSE
gap.opening.cost
Inf
will result in a gapless alignment.10
gap.extension.cost
Inf
will result in a gapless alignment.4
In this example, divergent cells are filtered out, and only cells with both a VDJ and VJ transcript are included in the VDJ dataframe. The resulting VDJ dataframe contains 17,891 cells with both a single VDJ and VJ transcript. The number of cells filtered out due to the lack of VDJ and VJ transcripts, referred to as ‘incomplete cells’, is shown in the right column of the printed table. Since a cell can have a double VDJ or VJ transcript or may lack a VDJ or VJ transcript, such a cell can be included in both the counts of ‘divergent cells’ and ‘incomplete cells’. Note that, as a result, the sum of these counts may not exactly match the difference in the number of rows between the dataframes. Additionally, for each clone, the germline sequences are trimmed, and an additional germline sequence column is added in which the CDR3 region is replaced by the most frequently observed CDR3 sequence in that clone.
# Read in data, keep only complete, non-divergent cells and trim germline sequences
VDJ_OVA <- Platypus::VDJ_build(VDJ.directory = "10x_output/VDJ/",
remove.divergent.cells = TRUE,
complete.cells.only = TRUE,
trim.germlines = TRUE,
fill.germline.CDR3 = TRUE)
divergent.cells | incomplete.cells | |
---|---|---|
S1 | 360 | 708 |
S2 | 295 | 503 |
S3 | 79 | 387 |
S4 | 183 | 380 |
S5 | 526 | 1113 |
For illustration, the dummy VDJ dataframe below contains a few columns of the first 10 rows of S1:
sample_id | barcode | isotype | clonotype_id | clonotype_frequency | VDJ_vgene | VDJ_jgene |
---|---|---|---|---|---|---|
S1 | AAACCTGTCGTTTGCC | IgM | clonotype1 | 432 | IGHV3-1 | IGHJ3 |
S1 | AAAGCAAAGCAGGCTA | IgM | clonotype1 | 432 | IGHV3-1 | IGHJ3 |
S1 | AAAGTAGCAGCTCCGA | IgM | clonotype1 | 432 | IGHV3-1 | IGHJ3 |
S1 | AAAGTAGGTTAAGGGC | IgM | clonotype1 | 432 | IGHV3-1 | IGHJ3 |
S1 | AAAGTAGTCAACGCTA | IgM | clonotype1 | 432 | IGHV3-1 | IGHJ3 |
S1 | AACGTTGCAGACTCGC | IgM | clonotype1 | 432 | IGHV3-1 | IGHJ3 |
S1 | AACTCAGAGAGAGCTC | IgM | clonotype1 | 432 | IGHV3-1 | IGHJ3 |
S1 | AACTCAGTCAGGTTCA | IgM | clonotype1 | 432 | IGHV3-1 | IGHJ3 |
S1 | AACTCAGTCCGTTGCT | IgM | clonotype1 | 432 | IGHV3-1 | IGHJ3 |
S1 | AACTCTTGTTTACTCT | IgM | clonotype1 | 432 | IGHV3-1 | IGHJ3 |
Af_build()
functionThe Af_build()
function infers B/T cell
evolutionary networks for all clonotypes in a VDJ dataframe, thereby
providing insights into the evolutionary relationships between BCR/TCR
sequences from each clonotype. Upon execution, the function generates an
AntibodyForests object containing lineage trees, sequences, and other
specified features for downstream analyses. The resulting object can be
used for visualization, comparison, and further analysis of B cell
repertoires.
For all clonotypes, the unique (combination of) sequences are
selected, which will be nodes in the lineage tree. Until now, five
algorithms have been implemented to construct B cell lineage trees from
this list of sequences. Three construction algorithms are based on a
pairwise distance matrix: ‘phylo.network.default’, ‘phylo.network.mst’,
and ‘phylo.tree.nj’. If the string.dist.metric
is
specified, the distance matrices are calculated using the
stringdist::stringdistmatrix()
function; if the
dna.model
or aa.model
is specified, the
distance matrices are calculated using the ape::dist.dna()
or phangorn::dist.ml()
function, respectively. Two
construction algorithms are based on a multiple sequence alignment
(msa): ‘phylo.tree.mp’ and ‘phylo.tree.ml’. The msa is created with the
msa::msa()
function. The five algorithms are explained in
the description of the construction.method
parameter
below.
In addition, there is also the option to import pre-constructed
lineage trees from an IgPhyML output file into an AntibodyForests
object, which enables to comparison of trees created with different
construction methods on the same sequencing data in a repertoire-wide
manner. IgPhyML is a command-line tool that infers maximum likelihood
trees using a phylogenetic codon substitution model specific for
antibody lineages, as explained by Hoehn et al.
(2017). Some functions have been included to create the appropriate
input files from IgPhyML from 10x Genomics outputs, by running the
IgBLAST tool and integrating the output with the VDJ dataframe. These
function are discussed in the supplementary of this vignette.
VDJ
VDJ_build()
function or a dataframe that contain the
columns specified in the sequence.columns
,
germline.columns
, and node.features
column.VDJ_df
sequence.columns
c("VDJ_sequence_aa_trimmed", "VJ_sequence_aa_trimmed")
c("VDJ_sequence_nt_trimmed", "VJ_sequence_nt_trimmed")
germline.columns
c("VDJ_germline_aa_trimmed", "VJ_germline_aa_trimmed")
c("VDJ_germline_nt_trimmed", "VJ_germline_nt_trimmed")
concatenate.sequences
TRUE
, sequences from
multiple sequence columns are concatenated into one sequence for single
distance matrix calculations/multiple sequence alignments. If
FALSE
, a distance matrix is calculated/multiple sequence
alignment is performed for each sequence column separately.FALSE
node.features
c("VDJ_vgene", "VDJ_dgene", "VDJ_jgene", "isotype")
"isotype"
(if present)string.dist.metric
stringdist::stringdistmatrix()
function. This parameter is
applicable only when a pairwise DNA or protein distance matrix will be
calculated and when the dna.model
and aa.model
parameters are not specified. Options include:"lv"
(Levenshtein distance / edit
distance)"dl"
(Damerau-Levenshtein
distance)"osa"
(Optimal String Alignment
distance)"hamming"
(Hamming distance)"qgram"
(Q-gram distance)"cosine"
(Cosine distance)"jaccard"
(Jaccard distance)"jw"
(Jaro-Winkler distance)"lv"
dna.model
construction.method
parameter, an
evolutionary DNA model can be specified that is to be used during the
pairewise distance calculation. Currently, the following models are
included: ‘raw’, ‘N’, ‘TS’, ‘TV’, ‘JC69’, ‘K80’, ‘F81’, ‘K81’, ‘F84’,
‘BH87’, ‘T92’, ‘TN93’, ‘GG95’, ‘logdet’, ‘paralin’, ‘indel’, and
‘indelblock’. The pairwise-distance matrix is calculated using the
ape::dist.dna()
function. When the
construction.method
parameter is set to
phylo.tree.ml
, a nucleotide subsitution model can be
specified that is to be use during the maximum likelihood
tree-inference. Currently, the following nucleotide substitution models
are included: ‘JC’, ‘F81’, ‘K80’, ‘HKY’, ‘TrNe’, ‘TrN’, ‘TPM1’, ‘K81’,
‘TPM1u’, ‘TPM2’, ‘TPM2u’, ‘TPM3’, ‘TPM3u’, ‘TIM1e’, ‘TIM1’, ‘TIM2e’,
‘TIM2’, ‘TIM3e’, ‘TIM3’, ‘TVMe’, ‘TVM’, ‘SYM’, and ‘GTR’. When set to
"all"
, all available nucleotide substitution models will be
tested using the phangorn::modelTest()
function, after
which the result is given to the phangorn::pml_bb()
function for the maximum likelihood tree inference."all"
construction.method
parameter)
and "all"
(when the construction.method
parameter is set to “phylo.tree.ml”).aa.model
construction.method
parameter, an
evolutionary protein model can be specified that is to be used during
the pairwise distance calculation. Currently, the following models are
included: “WAG”, “JTT”, “LG”, “Dayhoff”, “VT”, “Dayhoff_DCMut” and
“JTT-DCMut”. The pairwise-distance matrix is calculated using the
phangorn::dist.ml()
function. When the
construction.method
parameter is set to ‘phylo.tree.ml’, an
amino acid subsitution model can be specified that is to be use during
the maximum likelihood tree-inference. Currently, the following amino
acid substitution models are included: “WAG”, “JTT”, “LG”, “Dayhoff”,
“VT”, “Dayhoff_DCMut” and “JTT-DCMut”. When set to "all"
,
all available nucleotide substitution models will be tested using the
phangorn::modelTest()
function, after which the result is
given to the phangorn::pml_bb()
function for the maximum
likelihood tree inference."all"
construction.method
parameter)
and "all"
(when the construction.method
parameter is set to “phylo.tree.ml”).codon.model
construction.method
is set to
“phylo.tree.ml” and when columns with nucleotide sequences are specified
in the sequence.columns
and germline.columns
parameters. Currently, three codon substution models are available: ‘M0’
(which assumes a non-distinct non-synonymous/synonymous ratio (ω), ‘M1a’
(that estimates two different ω classes: ω = 1 & ω < 1), and
‘M2a’ (that estimates three different ω classes: ω < 1, ω = 1,
positive selection ω > 1)."M0"
construction.method
"phylo.network.default"
"phylo.network.mst"
ape::mst()
function. It constructs networks with the
minimum sum of edge lengths/weights by iteratively adding edges to the
network in ascending order of edge weights, while ensuring that no
cycles are formed. The network is then reorganized into a
germline-rooted lineage tree."phylo.tree.nj"
ape::nj()
function. It constructs phylogenetic trees by
joining pairs of nodes with the minimum distance, resulting in a
bifurcating tree consisting of internal nodes (representing unrecovered
sequences) and terminal nodes (representing the recovered
sequences)."phylo.tree.mp"
phangorn::pratchet()
function. It constructs phylogenetic
trees by minimizing the total number of edits required to explain the
observed differences among sequences."phylo.tree.ml"
phangorn::pml_bb()
function. It constructs phylogenetic
trees by estimating the tree topology and branch lengths that maximize
the likelihood of observing the given sequence data under a specified
evolutionary model."phylo.tree.IgPhyML"
IgPhyML.output.file
parameter."phylo.network.default"
(if
the IgPhyML.output.file
parameter is not specified) and
"phylo.tree.IgPhyML"
(if the
IgPhyML.output.file
parameter is specified)IgPhyML.output.file
construction.method
is not specified or set to
"phylo.tree.IgPhyML"
."path/to/IgPhyML/output/file.tab"
resolve.ties
construction.method
is set to “phylo.network.default”. Ties
occur when an unlinked node, which is to be linked to the tree next,
shares identical distances with multiple previously linked nodes in the
lineage tree. If a vector is provided, ties will be resolved in a
hierarchical manner, following the order specified in the vector. If a
tie could not be resolved, the node is connected to all nodes, thereby
creating a cyclic structure, and a warning message is printed after
creation of the AntibodyForests object. There are seven ways to handle
ties:"min.expansion"
"max.expansion"
Selects the node(s)
with the biggest size."min.germline.dist"
"max.germline.dist"
Selects the
node(s) with the biggest string distance to the germline node."min.germline.edges"
"max.germline.edges"
"min.descendants"
"max.descendants"
"random"
Selects a random node.c("max.expansion", "close.germline.dist", "close.germline.edges", "random")
(if the construction.method
parameter is set to
"phylo.network.default"
)remove.internal.nodes
"zero.length.edges.only"
"connect.to.parent"
"minimum.length"
This algorithm
iteratively removes internal nodes by prioritizing edges with the
minimum length for deletion."minimum.cost"
"minimum.length"
option, this algorithm
iteratively removes internal nodes. However, instead of prioritizing
edges based solely on their length, it prioritizes edges that result in
the minimum increase in the sum of all edge lengths."minimum.cost"
(when the
construction.method
parameter is set to
"phylo.tree.nj"
) and "connect.to.parent"
(when
the construction.method
parameter is set to
"phylo.tree.mp"
, "phylo.tree.ml"
, or
"phylo.tree.IgPhyML"
)In the example below, the chosen construction method,
"phylo.network.default"
, employs a mst-like algorithm for
constructing evolutionary networks. Both the heavy and light chain
sequences are used to construct the tree, these sequences are not
concatenated. Additional features, such as isotype, are included
(node.features = "isotype"
) for downstream analyses. String
distances are calculated using the Levenshtein distance metric to handle
sequences with differing lengths or mutations. In the situation where a
node that is to be added shares the same distance to multiple nodes in
the tree, the tie resolution methods (resolve.ties
) are
employed in a hierarchical manner: first by maximum expansion
("max.expansion"
), followed by proximity to the germline
("close.germline.dist"
) and number of edges to the germline
("close.germline.edges"
). The "random"
option
is included to prevent cyclic structures from being created in the
lineage tree.
# Infer lineage trees for all clones using the 'phylo.network.default' construction method and using all default parameter settings
AntibodyForests_OVA_default <- Af_build(VDJ = VDJ_OVA,
sequence.columns = c("VDJ_sequence_nt_trimmed", "VJ_sequence_nt_trimmed"),
germline.columns = c("VDJ_germline_nt_trimmed", "VJ_germline_nt_trimmed"),
concatenate.sequences = FALSE,
node.features = "isotype",
construction.method = "phylo.network.default",
string.dist.metric = "lv",
resolve.ties = c("max.expansion", "close.germline.dist", "close.germline.edges", "random"),
parallel = TRUE)
In this example, the construction method employs the Maximum
Likelihood (ML) algorithm to infer evolutionary trees. The
dna.model
parameter is set to "all"
,
indicating that various DNA substitution models are considered during
tree construction to find the one that best fits the data. The
remove.internal.nodes
parameter is set to
"connect.to.parent"
, which means that during tree
construction, internal nodes will be pruned by connecting the descendant
nodes directly to the parent node.
# Infer lineage trees for all clones using the 'phylo.tree.ml' construction method and using all default parameter settings
AntibodyForests_OVA_ml <- Af_build(VDJ = VDJ_OVA,
sequence.columns = c("VDJ_sequence_nt_trimmed", "VJ_sequence_nt_trimmed"),
germline.columns = c("VDJ_germline_nt_trimmed", "VJ_germline_nt_trimmed"),
concatenate.sequences = FALSE,
construction.method = "phylo.tree.ml",
dna.model = "all",
remove.internal.nodes = "connect.to.parent",
parallel = TRUE)
In this final example, it is demonstrated how the construction method
"phylo.tree.IgPhyML"
can be used for compatibility with the
IgPhyML tool. Unlike other methods where trees are inferred, this method
imports trees directly from the specified IgPhyML output file
(IgPhyML.output.file = "data/OVA_scRNA-seq_data/VDJ/airr_rearrangement_igphyml-pass.tab"
).
The IgPhyML tool only accepts VDJ sequences as input, which is why only
the "VDJ_sequence_nt_trimmed"
and
"VDJ_germline_nt_trimmed"
columns are selected. This
ensures that the relevant DNA sequences are imported from the VDJ
dataframe into the AntibodyForests object. It is important to note that
the IgPhyML tool also employs a maximum likelihood algorithm, similar to
the "phylo.tree.ml"
method, but uses a model that is
specifically adapted to antibody sequences. This specialized model
provides more accurate phylogenetic reconstructions for antibody
sequence data.
# Infer lineage trees for all clones using the 'phylo.tree.IgPhyML' construction method and using all default parameter settings
AntibodyForests_OVA_IgPhyML <- Af_build(VDJ = VDJ_OVA,
sequence.columns = "VDJ_sequence_nt_trimmed",
germline.columns = "VDJ_germline_nt_trimmed",
construction.method = "phylo.tree.IgPhyML",
IgPhyML.output.file = "data/OVA_scRNA-seq_data/VDJ/airr_rearrangement_igphyml-pass.tab",
remove.internal.nodes = "connect.to.parent",
parallel = TRUE)
Af_plot_tree()
functionThe Af_plot_tree()
function retrieves
the igraph object from the provided AntibodyForests object for the
specified clone within the specified sample and plots the lineage tree
using the specified plotting parameters. This visualization offers
insights into sequence evolution, enabling the observation of
evolutionary processes in relation to sequence distances, clonal
expansion, and other attributes such as isotype or evolutionary
likelihood, as well as additional features appended to the VDJ
dataframe.
AntibodyForests_object
Af_build()
function.list("Default" = AntibodyForests_OVA_default, "MST" = AntibodyForests_OVA_mst, "IgPhyML" = AntibodyForests_OVA_IgPhyML)
sample
"S1"
clonotype
"clonotype1"
label.by
"name"
"size"
The number of cells per node
are displayed."none"
"name"
color.by
"celltype"
"isotype"
or
NULL
(if isotype is not present) edge.label
"original"
"none"
The lineage tree for one of the clonotypes of S1 is displayed,
constructed using the phylo.tree.ml
algorithm. This
specific example demonstrates plotting the lineage tree with the
internal nodes (show.inner.nodes = TRUE
). Nodes are labeled
with their names (label.by = "name"
), colored based on
isotype (color.by = "isotype"
). Node sizes reflect clonal
expansion (node.size = "expansion"
). This visualization
offers insights into the evolutionary relationships between the
sequences found within the clones, in relation to clonal expansion and
class-switching.
# Plot lineage tree, constructed with the `phylo.tree.ml' method, without internal nodes
Af_plot_tree(AntibodyForests_object = AntibodyForests_OVA_ml,
sample = "S1",
clonotype = "clonotype8",
show.inner.nodes = TRUE,
label.by = "name",
color.by = "isotype",
node.size = "expansion",
main.title = "Lineage tree - ML",
sub.title = "OVA - S1 - clonotype8")
The lineage tree for one of the clonotypes of S1 is displayed,
constructed using the phylo.tree.ml
algorithm. This
specific example demonstrates plotting the lineage tree without the
internal nodes (show.inner.nodes = FALSE
). Nodes are
labeled with their names (label.by = "name"
), colored based
on isotype (color.by = "isotype"
). Node sizes reflect
clonal expansion (node.size = "expansion"
). This
visualization offers insights into the evolutionary relationships
between the sequences found within the clones, in relation to clonal
expansion and class-switching.
# Plot lineage tree, constructed with the `phylo.tree.ml' method, without internal nodes
Af_plot_tree(AntibodyForests_object = AntibodyForests_OVA_ml,
sample = "S1",
clonotype = "clonotype8",
show.inner.nodes = FALSE,
label.by = "name",
color.by = "isotype",
node.size = "expansion",
main.title = "Lineage tree - ML",
sub.title = "OVA - S1 - clonotype8")
Af_compare_methods()
functionTo investigate the influence of different tree constructions methods
on the resulting tree topologies, the AntibodyForests package allows for
the comparison of trees generated using different algorithms. The
Af_compare_methods()
function can be used to compare the
tree topologies of multiple AntibodyForests objects. This function
calculates the distance between the trees of the different objects,
clusters the trees, and visualizes the results. Various distance metrics
can be used, such as the euclidean distance between the germline-to-node
distance for each node or the Generalized Branch
Length Distance (GBLD) between the trees. Visualization of the
results can be done using a heatmap, an MDS or PCA plot. The function
returns a list of the calculated distances, the clustering, and the
visualization for each clonotype. Additionally the results can be
averaged over all clonotypes.
When multiple AntibodyForests objects are created from the same VDJ
dataframe and using the same sequence columns, the nodes in the
resulting igraph objects should be the same. However, selecting
different sequence columns may result in a different number of nodes and
different node names. Additionally, the IgPhyML tool does not label the
nodes according to their size but rather by using barcodes. When
comparing tree topologies, having synchronized node names is
essential.To achieve this, the Af_sync_nodes()
function can
be used. This function takes one AntibodyForests object as a reference
and renames the nodes of all clonotypes within all samples of another
AntibodyForests object to match the reference. The function matches the
nodes using barcodes. If the barcodes of one node in the reference
object are found in multiple nodes in the subject object, the nodes in
the subject will get suffixes A, B, etc. (e.g., node4 will become node4A
and node4B). Conversely, if the barcodes of multiple nodes in the
reference object are found in only one node in the subject object, the
node in the subject object will get the concatenated names of the nodes
in the reference object (e.g., node2 and node4 will become node2+4).
input
Af_build()
function.list("Default" = AntibodyForests_OVA_default, "MST" = AntibodyForests_OVA_mst, "IgPhyML" = AntibodyForests_OVA_IgPhyML)
min.nodes
10
2
include.average
TRUE
FALSE
distance.method
euclidean
of node depths or
GBLD
.GBLD
euclidean
depth
edge.count
or
edge.length
.edge.length
edge.count
visualization.methods
heatmap
, MDS
,
PCA
or NULL
.heatmap
NULL
First, node labels need to be synchronized between the different
AntibodyForests objects. This can be done using the
Af_sync_nodes()
function. Here, the nodes
of the AntibodyForests objects created with the IgPhyML tool, MST, NJ,
and ML algorithms are synchronized with the nodes of the AntibodyForests
object created with the default algorithm.
# Synchronize the nodes of the different AntibodyForests objects
AntibodyForests_OVA_IgPhyML <- Af_sync_nodes(reference = AntibodyForests_OVA_default,
subject = AntibodyForests_OVA_IgPhyML)
AntibodyForests_OVA_mst <- Af_sync_nodes(reference = AntibodyForests_OVA_default,
subject = AntibodyForests_OVA_mst)
AntibodyForests_OVA_nj <- Af_sync_nodes(reference = AntibodyForests_OVA_default,
subject = AntibodyForests_OVA_nj)
AntibodyForests_OVA_mp <- Af_sync_nodes(reference = AntibodyForests_OVA_default,
subject = AntibodyForests_OVA_mp)
AntibodyForests_OVA_ml <- Af_sync_nodes(reference = AntibodyForests_OVA_default,
subject = AntibodyForests_OVA_ml)
Now we can calculate the distance between the trees of the different
AntibodyForests objects using the
Af_compare_methods()
function using the
Generalized Branch Length Distance.
# Calculate the euclidean distance between different trees based on the GBLD
out <- Af_compare_methods(input = list("Default" = AntibodyForests_OVA_default,
"MST" = AntibodyForests_OVA_mst,
"NJ" = AntibodyForests_OVA_nj,
"MP" = AntibodyForests_OVA_mp,
"ML" = AntibodyForests_OVA_ml,
"IgPhyML" = AntibodyForests_OVA_IgPhyML),
min.nodes = 10,
distance.method = "GBLD",
visualization.methods = "heatmap",
include.average = T)
# Plot the average heatmap
out$average$Heatmap
Af_compare_within_repertoires()
functionThe complex process of somatic hypermutation in a repertoire can be
analyzed by comparing lineage tree topologies between clonotypes to
identify different evolutionary patterns. The AntibodyForests package
provides a function,
Af_compare_within_repertoires()
, to
calculates the distance between all trees in the repertoire. This can be
the euclidean distance between various calculated metrics, or the
Jensen-Shannon distance between spectral density
profiles of trees. The function can cluster the trees and visualize
the results using a heatmap, MDS, or PCA plot. The function returns a
list containing the distance matrix, the assigned clusters, and the
visualization. The functions
Af_cluster_metrics()
and
Af_cluster_node_features()
can be used to
analyze the differences between the clusters of trees.
The function Af_metrics()
is used
internally in different AntibodyForests function, but can also be used
as standalone function to calculate metrics for each clonotype. Basic
metrics such as the average germline-to-node depth, total number of
nodes and cells, and sackin index can be calculated, as well as more
complex metrics based on the Laplacian Spectrum of the trees. These
metrics can be returned in a dataframe, or added to the AntibodyForests
object.
The functions Af_distance_boxplot()
and Af_distance_scatterplot
can be used to
observe patterns between the distance to the germline and various node
features in the AntibodyForests object.
input
Af_build()
function.AntibodyForests_OVA_default
metrics
"nr.nodes"
The total number of
nodes."nr.cells"
The total number of
cells."mean.depth"
The average number of
edges connecting each node to the germline along the shortest
path."mean.edge.length"
The average sum of
edge lengths connecting each node to the germline along the shortest
path.
"sackin.index"
The sackin index is
used to quantify the imbalance of a tree. This is calculated as the sum
of the depths of all leaves in the tree, normalized by the number of
leaves."spectral.density"
This option
constructs the spectral density profile of a tree to identify the
underlying shape of the phylogeny. This option outputs four metrics:
spectral.peakedness
for tree imbalance,
spectral.asymmetry
to quantify skewness of branching
events, spectral.principal.eigenvalue
as indicator for
diversity, and modalities
as the number of different
density profiles within this tree."group.node.depth"
The average number
of edges connecting each node to the germline along the shortest path
per group (node features). This option is used internally in
AntibodyForests function, we do not recommend used it externally."group.edge.length"
The average sum of
edge lengths connecting each node to the germline along the shortest
path per group (node features). This option is used internally in
AntibodyForests function, we do not recommend used it externally.c("mean.depth","spectral.density","mean.edge.length")
c("mean.depth", "nr.nodes")
output.format
"dataframe"
"AntibodyForests"
input
Af_build()
function.AntibodyForests_OVA_default
distance.method
euclidean
of node depths or
jensen-shannon
of spectral density profiles.jensen-shannon
euclidean
visualization.methods
heatmap
, MDS
,
PCA
or NULL
.heatmap
NULL
input
Af_build()
function.AntibodyForests_OVA_default
clusters
Af_compare_within_repertoires()
.out$clustering
significance
FALSE
TRUE
input
Af_build()
function.AntibodyForests_OVA_default
clusters
Af_compare_within_repertoires()
.out$clustering
features
c("vgene_family", "timepoint", "isotype")
fill
unique
), or to assign the most common
node feature (max
).max
unique
significance
FALSE
TRUE
AntibodyForests_object
Af_build()
function.AntibodyForests_OVA_default
distance
"node.depth"
(The sum of
edges on the shortest parth between germline and each node) or
"edge.length"
(The sum of edge lengths of the shortest path
between germline and each node)."node.depth"
"edge.length"
groups
node.features
to compare.c("IgM", "IgG")
NA
unconnected
"TRUE"
"FALSE"
As an example, we calculate the metric dataframe of an AntibodyForests object for trees with at least 10 nodes. The rownames include the sample ID, followed by the clonotype ID.
# Calculate the metric dataframe for the AntibodyForests object
metric_df <- Af_metrics(input = AntibodyForests_OVA_default,
min.nodes = 10,
metrics = c("mean.depth", "sackin.index", "spectral.density"))
We also investigate if nodes from the IgA isotype are closer to the germline in these trees than nodes from the IgG isotype. We visualize the results in a boxplot and observe no significant difference.
# Cluster the trees in the repertoire based on the euclidean distance between various metrics of each tree
Af_distance_boxplot(AntibodyForests_object = AntibodyForests_OVA_default,
node.feature = "isotype",
distance = "node.depth",
min.nodes = 8,
groups = c("IgG1", "IgA"),
text.size = 16,
unconnected = T,
significance = T)
Now we cluster the trees in this AntibodyForests object based on the euclidean distance between these metrics. We visualize the results in a PCA plot and observe three clusters.
# Cluster the trees in the repertoire based on the euclidean distance between various metrics of each tree
out <- Af_compare_within_repertoires(input = AntibodyForests_OVA_default,
min.nodes = 8,
distance.method = "euclidean",
distance.metrics = c("mean.depth", "sackin.index", "spectral.density"),
clustering.method = "mediods",
visualization.methods = "PCA")
# Plot the PCA colored on the assigned clusters
out$plots$PCA_clusters
Next, we analyze the difference between the clusters with the
functions Af_cluster_metrics()
and
Af_cluster_node_features()
. Here, we
observe that cluster 1 contains trees with a relatively high sackin
index and a low number of spectral density modalities. Cluster 2
contains trees with a low sackin index and a high number of spectral
density modalities. Cluster 3 contains trees with a low sackin index and
a low number of spectral density modalities. And cluster 3 contains
trees with a low sackin index and a high number of spectral density
modalities. Moreover, there is no significant difference in isotype
distribution between the clusters.
# Analyze the difference between the clusters based on the calculated sackin.index
plots <- Af_cluster_metrics(input = AntibodyForests_OVA_default,
clusters = out$clustering,
metrics = c("sackin.index", "spectral.density"),
min.nodes = 8,
significance = T)
plots$sackin.index
plots$modalities
# Analyze the difference between the clusters based on the calculated sackin.index
plots <- Af_cluster_node_features(input = AntibodyForests_OVA_default,
clusters = out$clustering,
features = "isotype",
fill = "max",
significance = T)
plots$isotype
Now we cluster the trees in this AntibodyForests object based on the Jensen-Shannon divergence between the Spectral Density profiles. We visualize the results in a heatmap and observe two clusters.
# Cluster the trees in the repertoire based on the Jensen-Shannon divergence between the Spectral Density profiles
out <- Af_compare_within_repertoires(input = AntibodyForests_OVA_default,
min.nodes = 8,
distance.method = "jensen-shannon",
clustering.method = "mediods",
visualization.methods = "heatmap")
# Plot the heatmap
out$plots$heatmap_clusters
Next, we analyze the difference between the clusters with the
functions Af_cluster_metrics()
and
Af_cluster_node_features()
. Here, we
observe that cluster 1 contains trees with a high spectral asymmetry and
a low number of spectral density modalities, and cluster 2 vice
versa.
# Analyze the difference between the clusters based on the calculated sackin.index
plots <- Af_cluster_metrics(input = AntibodyForests_OVA_default,
clusters = out$clustering,
metrics = "spectral.density",
min.nodes = 8,
significance = T)
plots$spectral.asymmetry
plots$modalities
Af_compare_across_repertoires()
functionThe quantify different evolutionary patterns between (groups of)
repertoires within or across individuals, the AntibodyForests package
provides a function,
Af_compare_across_repertoires()
, to
calculates tree topology metrics of trees and compares them between
repertoires. The function included different graph theory metrics such
as the edge betweenness or the degree centrality, but it also includes
all the metrics available in the
Af_metrics()
function. The function can
plot these metrics in boxplots or frequency polygons per repertoire and
calculates the statistical significance.
AntibodyForests_list
Af_build()
function.list("repertoire1" = AntibodyForests_object1, "repertoire2" = AntibodyForests_object2)
plot
boxplot
or
freqpoly
.freqpoly
boxplot
significance
TRUE
FALSE
We compare the difference in edge betweenness and degree centrality between B cell repertoires of different samples from blood and lymph nodes. We observe that the lymph node repertoire have a higher betweenness and lower mean depth in comparison to the blood repertoires, meaning more variants emerging from the same node instead of a deeper tree where variants emerge sequentially.
#Calculate the difference between the two groups of repertoires
boxplots <- Af_compare_across_repertoires(list("Blood" = AntibodyForests_example_blood,
"Lymph Nodes" = AntibodyForests_example_LN),
metrics = c("betweenness", "mean.depth"), significance = T)
#Plot the boxplots
boxplots$betweenness
boxplots$mean.depth
PLMs have demonstrated success in understanding the unique features
of antibody evolution, such as germline
gene usage and somatic hypermutation (SHM). The probabilities
resulting from PLMs can be used as an estimate for evolutionary
likelihood, both at the per-residue as the per-sequence
(pseudolikelihood) level. These likelihoods can be incorporated into
AntibodyForests to evaluate patterns in pseudolikelihood and tree
topology, and to evaluate mutations along the edges of the trees.
The pseudolikelihoods and probability matrices can be supplied
manually, or easily generated using the PLM-pipeline.
Here, a VDJ-dataframe can be used as input CSV file, and the pipeline
can generate the pseudolikelihoods and probability matrices of those
sequences using different PLMs. The resulting pseudolikelihoods can be
incoporates as node feature in the AntibodyForests-object using the
Af_add_node_feature()
function. The
pseudolikelihoods can be visualized using the
Af_plot_tree()
function, and the
correlation with distance to the germline can be visualized using the
Af_distance_scatterplot
function.
Sequences corresponding to nodes in the AntibodyForests-object can
be extracted using the Af_get_sequences()
function. This dataframe can be used as input CSV file for the PLM-pipeline to be
able to match probability matrices to the right trees and nodes.
PLM-generated probability matrices can be supplied to the function
Af_PLM_dataframe()
to generate a dataframe
with the ranks and probabilities of the mutations along the edges of the
lineage trees. These ranks and probabilties of both the mutated amino
acids and the original amino acid can be visualized using the
Af_plot_PLM()
function.
AntibodyForests_object
Af_build()
function.AntibodyForests_OVA_default
sequence.name
Af_build()
function.VDJ_sequence_aa_trimmed
path_to_probabilities
sample_id
_ clonotype_id
_ node number
(example:
S1_clonotype1_node1
), matching the AntibodyForests
object."/directory/ProbabilityMatrix"
PLM_dataframe
Af_PLM_dataframe()
function.PLM_dataframe
values
"substitution_rank"
,
"substitution_probability"
, "original_rank"
,
or "original_probability"
."original_probability"
"substitution_rank"
group.by
"sample_id"
) or all edges together
("none"
)."sample_id"
"none"
colors
group.by
is "sample_id"
, this
should be a vector of the same length as the number of samples.c("blue","red")
ggplot2
colors.Pseudolikelihoods were calculated with the PLM-pipeline for
the sequences in the VDJ dataframe. These pseudolikelihoods first have
to be added to the AntibodyForests object. Then we plot the
pseudolikelihood against the distance to the germline based on the
edge.length
.
To investigate the PLM likelihoods of somatic hypermutation, we first need to extract sequences from the AntibodyForests object. These sequences can be used as input CSV file for the PLM-pipeline.
#Create a dataframe of the sequences with a sequence ID containing the sample, clonotype, and node number
df <- Af_get_sequences(AntibodyForests_object = AntibodyForests_OVA_default,
sequence.name = "VDJ_sequence_aa_trimmed",
min.edges = 1)
#Save the data frame as CSV to serve as input for the PLM-pipeline
write.csv(df, file = "/directory/PLM_input.csv", row.names = FALSE)
The generated probability matrices can be used to create a dataframe of the ranks and probabilities of the mutations along the edges of the lineage trees.
#Create a PLM dataframe
PLM_dataframe <- Af_PLM_dataframe(AntibodyForests_object = AntibodyForests_OVA_default,
sequence.name = "VDJ_sequence_aa_trimmed",
path_to_probabilities = "/directory/ProbabilityMatrix")
Below is an example of the first 20 rows of a PLM dataframe. Each row
in this data frame represents an edge between node1
and
node2
in the lineage tree of sample
and
clonotype
. The number of mutations along this edge is given
by n_subs
. When there are multiple mutations, the average
of the ranks and probabilities are reported.
sample | clonotype | n_subs | node1 | node2 | mean_substitution_rank | mean_substitution_probability | mean_original_rank | mean_original_probability |
---|---|---|---|---|---|---|---|---|
S1 | clonotype9 | 1 | node2 | node1 | 16.000000 | 0.0002306 | 1.000000 | 0.6992657 |
S1 | clonotype9 | 1 | node2 | node14 | 11.000000 | 0.0010976 | 1.000000 | 0.8091726 |
S1 | clonotype9 | 1 | node2 | node15 | 7.000000 | 0.0019094 | 1.000000 | 0.6992657 |
S1 | clonotype9 | 1 | node16 | node17 | 7.000000 | 0.0019271 | 1.000000 | 0.7098383 |
S1 | clonotype9 | 1 | node15 | node3 | 9.000000 | 0.0007687 | 3.000000 | 0.0130641 |
S1 | clonotype9 | 1 | node15 | node5 | 19.000000 | 0.0001328 | 1.000000 | 0.8807737 |
S1 | clonotype9 | 1 | node15 | node7 | 10.000000 | 0.0050999 | 1.000000 | 0.8214574 |
S1 | clonotype9 | 1 | node1 | node9 | 3.000000 | 0.0012022 | 1.000000 | 0.9840081 |
S1 | clonotype9 | 1 | node15 | node10 | 2.000000 | 0.0327258 | 1.000000 | 0.9617180 |
S1 | clonotype9 | 1 | node1 | node11 | 11.000000 | 0.0105127 | 4.000000 | 0.0643433 |
S1 | clonotype9 | 1 | node7 | node13 | 17.000000 | 0.0006641 | 2.000000 | 0.1736919 |
S1 | clonotype9 | 2 | node1 | node12 | 4.000000 | 0.0043949 | 1.000000 | 0.9147456 |
S1 | clonotype9 | 2 | node15 | node18 | 6.500000 | 0.0002146 | 1.000000 | 0.9928794 |
S1 | clonotype9 | 1 | node12 | node4 | 10.000000 | 0.0064510 | 1.000000 | 0.8795548 |
S1 | clonotype9 | 3 | node3 | node6 | 8.666667 | 0.0016430 | 1.333333 | 0.6734619 |
S1 | clonotype15 | 1 | node3 | node9 | 16.000000 | 0.0002532 | 1.000000 | 0.9315779 |
S1 | clonotype15 | 1 | node3 | node10 | 9.000000 | 0.0000105 | 1.000000 | 0.9919440 |
S1 | clonotype15 | 2 | node3 | node8 | 4.000000 | 0.0001810 | 1.000000 | 0.9986933 |
S1 | clonotype15 | 2 | node3 | node15 | 4.000000 | 0.0002345 | 1.000000 | 0.9959176 |
S1 | clonotype15 | 1 | node15 | node4 | 13.000000 | 0.0158098 | 1.000000 | 0.2115078 |
The ranks
and probabilities
of the
original
and substitution
redisues can be
visualized in distribution plots.
AntibodyForests lineage trees reflect sequence-based variation during
affinity maturation, but the 3D structure of the antibodies and antigens
determine the actual binding. The 3D structure of antibodies can be
predicted using computational methods such as homology modelling, or the
structure can be inferred experimentally. AntibodyForests provides
functions to incorporate 3D coordinates into the lineage trees, which
allows the analysis of the evolution of the 3D structure together with
somatic hypermutation.
AntibodyForests needs as input superimposed PDB files of protein
structures corresponding to the sequences of the nodes in the trees. For
example, the structures can be inferred with AlphaFold3
and superimposed on the C-alpha backbone using the Bio.PDB
package in Python. Next to the PDB file, AntibodyForests can incorporate
the 3Di sequence of the structures, which can be obtained using mini3di. Additionally,
pKa values and free energy of antibody-antigen complexes can be
integrated in the AntibodyForests object using the output of propka.
The function VDJ_3d_properties()
can
be used to calculate various biophysical properties from the PDB files,
such as the charge, hydrophobicity, pLDDT, RMSD or 3Di-distance to the
germline, or the pKa shift and free energy. These properties can be
added to the AntibodyForests object using the
Af_add_node_feature()
function. The 3D
properties can be visualized using the
Af_plot_tree()
function, and the
correlation with distance to the germline can be visualized using the
Af_distance_scatterplot
function. The
function Af_edge_RMSD()
can be used to
calculate the RMSD between the 3D structures of the mutating sequences
along the edges of the lineage trees.
VDJ
VDJ_build()
function that was used to build the
AntibodyForests object.VDJ
pdb.dir
/path/to/pdb/
file.df
file_name
) to be used and sequence IDs (column
sequence
) corresponding to the the barcodes
in
the VDJ dataframe and AntibodyForests-object.file_df
properties
"charge"
The net electrical charge at
pH 7.0"hydrophobicity"
The sum of
hypdrophobicities of each amino acid, divided by the sequence
length."RMSD_germline"
The root mean square
deviation to the germline structure, this needs a PDB file for the
germline sequence."3di_germline"
The levenshtein
distance between the 3Di sequence of each node and the germline
sequence, this needs foldseek output."pKa_shift"
The acid dissociation
constant shift upon binding of the antibody to the antigen, this needs
Propka output."free_energy"
The free energy of
binding of the antibody to the antigen at a certain pH, this needs
Propka output."pLDDT"
The pLDDT score of the
structure prediction model.c("charge", "hydrophobicity")
free_energy_pH
5
7
sequence.region
"full.sequence"
Use the entire
sequence in the PDB file."sub.sequence"
Part of the full
sequence, for example the CDR3 region. This sub sequence must be a
column in the VDJ dataframe."binding.residues"
The binding residue
in the PDB file when providing the whole antibody-antigen complex."full.sequence"
sub.sequence.column
sequence.region
is "sub.sequence"
."VDJ_cdr3_aa"
NULL
chain
"HC+LC"
Both the heavy and light
chain."HC"
The heavy chain."LC"
The light chain."AG"
The antigen."whole.complex"
The whole
antibody-antigen complex (all available chains in the PDB file)."HC+LC"
propka.dir
"/path/to/propka/"
NULL
foldseek.dir
chain
containing the ‘A’, ‘B’, and/or ‘C’
chain."/path/to/foldseek/"
NULL
germline.pdb
20
12
Protein structures of the antibodies of one of the trees from the OVA dataset were determined with AlphaFold3, together with the Ovalbumin sequence. These PDB files were superimposed and used as input for AntibodyForests.
Calculate biophysical properties for the binding residues between the antigen and the heavy and light chain sequences of the nodes in the AntibodyForests object.
#Biophysical properties of the heavy and light chain sequence of the antibodies are calculated
vdj_structure_AbAg <- VDJ_3d_properties(VDJ = vdj,
pdb.dir = "~/path/PDBS_superimposed/",
file.df = files,
properties = c("charge",
"3di_germline",
"hydrophobicity",
"RMSD_germline",
"pKa_shift",
"free_energy",
"pLDDT"),
chain = "whole.complex",
sequence.region = "binding.residues",
propka.dir = "~/path/Propka_output/",
germline.pdb = "~/path/PDBS_superimposed/germline_5_model_0.pdb",
foldseek.dir = "~/path/3di_sequences/")
#Add the biophysical properties to the AntibodyForests object
af_structure_AbAg <- Af_add_node_feature(AntibodyForests_object = af,
feature.df = vdj_structure_AbAg,
feature.names = c("average_charge",
"3di_germline",
"average_hydrophobicity",
"RMSD_germline",
"average_pKa_shift",
"free_energy",
"average_pLDDT"))
Plot the correlation between the levenshtein sequence distance to the germline and the free energy, colored on the average pLDDT score of the structure prediction.
We investigate the change in protein structure during somatic hypermutation for the heavy and light chain sequences of the antibody.
IgPhyML is a command-line tool that utilizes a phylogenetic codon
substitution model tailored for antibody lineages to construct lineage
trees. The tool requires an AIRR-formatted data file where each sequence
is clustered into a clone, with specified columns including
sequence_id
, sequence
,
sequence_alignment
, germline_alignment
,
v_call
, d_call
, and clone_id
.
Notably, the sequence alignments must adhere to a specific IMGT numeric
scheme, ensuring that conserved amino acids from frameworks consistently
receive the same number across Ig or TR chain types, domains, and
species origins. To obtain these alignments, the IgBLAST tool can be
employed on the 10x Genomics output. Detailed instructions on converting
10x V(D)J data into the AIRR standardized format can be found in the
documentation of Change-O on this
page. The instructions below outline the installation process for
the IgPhyML and IgBLAST tools, as well as the procedure for processing
10x Genomics data using the AssignGenes.py
and
MakeDb.py
scripts.
# Create and activate environment
conda create -n igphyml
conda activate igphyml
# Install Autoconf and Automake
conda install -c conda-forge autoconf automake
# Install Change-O and Alakazam
conda install -c bioconda changeo r-alakazam
# For Linux, Install OpenMP-enabled C Compiler
conda install -c conda-forge gcc
# For MacOS, install Clang and Compilers
conda install -c conda-forge clang_osx-64 clangxx_osx-64
conda install -c conda-forge compilers make cmake
# For Windows, Install MinGW-w64 Toolchain
conda install -c conda-forge m2w64-toolchain
# Install BLAS and LAPACK libraries
conda install -c conda-forge python=3.9 blas blas-devel lapack libblas libcblas liblapack liblapacke libtmglib
# Clone IgPhyML repository within the conda environment
cd $(conda info --base)/envs/igphyml/share
git clone https://bitbucket.org/kleinstein/igphyml
# Navigate to IgPhyML direcoctory and compile IgPhyML with OpenMP support
cd igphyml
./make_phyml_omp
# Add 'igphyml/src' directory to PATH variable within the conda environment
export PATH=$PATH:$(pwd)/src
# Test IgPhyML
igphyml --version
# Download and extract IgBLAST 1.22.0
VERSION="1.22.0"
# For Linux:
wget ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/${VERSION}/ncbi-igblast-${VERSION}-x64-linux.tar.gz
tar -zxf ncbi-igblast-${VERSION}-x64-linux.tar.gz
# For MacOS:
wget ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/${VERSION}/ncbi-igblast-${VERSION}-x64-macosx.tar.gz
tar -zxf ncbi-igblast-${VERSION}-x64-macosx.tar.gz
# For Windows:
wget ftp://ftp.ncbi.nih.gov/blast/executables/igblast/release/${VERSION}/ncbi-igblast-${VERSION}-x64-win64.tar.gz
tar -zxf ncbi-igblast-${VERSION}-x64-win64.tar.gz
# Copy IgBLAST binaries to the conda environment
cp ncbi-igblast-${VERSION}/bin/* $(conda info --base)/envs/igphyml/share
# Download tools to set up IgBLAST database from the IMGT reference sequences and add to PATH
mkdir tools
cd tools
git clone https://bitbucket.org/kleinstein/immcantation
export PATH=$PATH:$(pwd)/immcantation/scripts
cd ..
# Download reference databases and setup IGDATA directory
fetch_igblastdb.sh -o $(conda info --base)/envs/igphyml/share/igblast
cp -r ncbi-igblast-${VERSION}/internal_data $(conda info --base)/envs/igphyml/share/igblast
cp -r ncbi-igblast-${VERSION}/optional_file $(conda info --base)/envs/igphyml/share/igblast
# Build IgBLAST database from IMGT reference sequences
fetch_imgtdb.sh -o $(conda info --base)/envs/igphyml/share/germlines/imgt
imgt2igblast.sh -i $(conda info --base)/envs/igphyml/share/germlines/imgt -o $(conda info --base)/envs/igphyml/share/igblast
# If the following error occurs during imgt2igblast.sh:
#
# Traceback (most recent call last):
# File "/anaconda3/envs/igphyml/share/igphyml/tools/clean_imgtdb.py", line 22, in <module>
# seq = SeqRecord(rec.seq.replace('.', '').upper(), id=name, name=name, description=name)
# AttributeError: 'Seq' object has no attribute 'replace'
#
# Add to the imports of the 'tools/immcantation/scripts/clean_imgtdb.py' script:
# from Bio.Seq import Seq
# And replace line 22 with:
# modified_seq_str = str(rec.seq).replace('.', '').upper()
# modified_seq = Seq(modified_seq_str)
# seq = SeqRecord(modified_seq, id=name, name=name, description=name)
# Execute igblastn to assign V(D)J gene annotations
AssignGenes.py igblast -s ../data/OVA_scRNA-seq_data/VDJ/S1/filtered_contig.fasta -b $(conda info --base)/envs/igphyml/share/igblast --organism mouse --loci ig --format blast
AssignGenes.py igblast -s ../data/OVA_scRNA-seq_data/VDJ/S2/filtered_contig.fasta -b $(conda info --base)/envs/igphyml/share/igblast --organism mouse --loci ig --format blast
AssignGenes.py igblast -s ../data/OVA_scRNA-seq_data/VDJ/S3/filtered_contig.fasta -b $(conda info --base)/envs/igphyml/share/igblast --organism mouse --loci ig --format blast
AssignGenes.py igblast -s ../data/OVA_scRNA-seq_data/VDJ/S4/filtered_contig.fasta -b $(conda info --base)/envs/igphyml/share/igblast --organism mouse --loci ig --format blast
AssignGenes.py igblast -s ../data/OVA_scRNA-seq_data/VDJ/S5/filtered_contig.fasta -b $(conda info --base)/envs/igphyml/share/igblast --organism mouse --loci ig --format blast
# Create database files to store sequence alignment information with the IMGT reference sequences
MakeDb.py igblast -i ../data/OVA_scRNA-seq_data/VDJ/S1/filtered_contig_igblast.fmt7 -s ../data/OVA_scRNA-seq_data/VDJ/S1/filtered_contig.fasta -r $(conda info --base)/envs/igphyml/share/germlines/imgt/mouse/vdj/imgt_mouse_*.fasta --10x ../data/OVA_scRNA-seq_data/VDJ/S1/filtered_contig_annotations.csv --extended
MakeDb.py igblast -i ../data/OVA_scRNA-seq_data/VDJ/S2/filtered_contig_igblast.fmt7 -s ../data/OVA_scRNA-seq_data/VDJ/S2/filtered_contig.fasta -r $(conda info --base)/envs/igphyml/share/germlines/imgt/mouse/vdj/imgt_mouse_*.fasta --10x ../data/OVA_scRNA-seq_data/VDJ/S2/filtered_contig_annotations.csv --extended
MakeDb.py igblast -i ../data/OVA_scRNA-seq_data/VDJ/S3/filtered_contig_igblast.fmt7 -s ../data/OVA_scRNA-seq_data/VDJ/S3/filtered_contig.fasta -r $(conda info --base)/envs/igphyml/share/germlines/imgt/mouse/vdj/imgt_mouse_*.fasta --10x ../data/OVA_scRNA-seq_data/VDJ/S3/filtered_contig_annotations.csv --extended
MakeDb.py igblast -i ../data/OVA_scRNA-seq_data/VDJ/S4/filtered_contig_igblast.fmt7 -s ../data/OVA_scRNA-seq_data/VDJ/S4/filtered_contig.fasta -r $(conda info --base)/envs/igphyml/share/germlines/imgt/mouse/vdj/imgt_mouse_*.fasta --10x ../data/OVA_scRNA-seq_data/VDJ/S4/filtered_contig_annotations.csv --extended
MakeDb.py igblast -i ../data/OVA_scRNA-seq_data/VDJ/S5/filtered_contig_igblast.fmt7 -s ../data/OVA_scRNA-seq_data/VDJ/S5/filtered_contig.fasta -r $(conda info --base)/envs/igphyml/share/germlines/imgt/mouse/vdj/imgt_mouse_*.fasta --10x ../data/OVA_scRNA-seq_data/VDJ/S5/filtered_contig_annotations.csv --extended
After parsing through each 10x Genomics V(D)J output directories, the
annotations and alignments are stored in TSV files named
filtered_contig_igblast_db-pass.tsv
. When these files
remain in the original sample directory, the annotations and alignments
from these files can be appended to the VDJ dataframe using the
VDJ_import_IgBLAST_annotations()
function. This function
allows appending annotations and alignments from all samples in one go
to create a combined dataset.
When using the VDJ_import_IgBLAST_annotations()
function, the method
parameter controls how the annotations
are incorporated into the existing dataframe. If set to
"append"
, new columns with the suffix ’_IgPhyML’ will be
added to the existing dataframe, preserving the original annotation
columns. On the other hand, if the method parameter is set to
"replace"
, the original annotation columns in the VDJ
dataframe will be replaced with the IgBLAST annotations, while retaining
the original columns with the suffix ’_10x’.
After appending the annotations and alignments to the VDJ dataframe
using the "append"
method, all the columns required by the
IgPhyML tool can be extracted. These columns can then be saved into an
AIRR-formatted TSV file using the VDJ_to_AIRR()
function,
facilitating further analysis or sharing in a standardized format.
# Append the IgBLAST annotations and alignment to the VDJ dataframe
VDJ_OVA <- VDJ_import_IgBLAST_annotations(VDJ = VDJ_OVA,
VDJ.directory = "../data/OVA_scRNA-seq_data/VDJ/",
method = "append")
# Write the VDJ dataframe into an AIRR-formatted TSV file
VDJ_to_AIRR(VDJ = VDJ_OVA,
output.file = "../data/OVA_scRNA-seq_data/VDJ/airr_rearrangement.tsv")
After generating the AIRR rearrangement, the input file can be
provided to the IgPhyML tool to construct the lineage trees for all
clones. IgPhyML is executed through the Change-O program BuildTrees by
specifying the --igphyml
option. The
--collapse
flag is utilized to merge identical sequences,
expediting calculations without affecting the likelihood calculations.
Furthermore, the --clean all
flag is employed to remove all
intermediate files generated after IgPhyML execution. The resulting
trees are saved to a TAB file named
airr_rearrangement_igphyml-pass.tab
, which can be imported
into an AntibodyForests object using the Af_build()
function.
BuildTrees.py -d ../data/OVA_scRNA-seq_data/VDJ/airr_rearrangement.tsv --collapse --igphyml --clean all
Although the package is designed for single-cell immune repertoire
data, the pipeline can also be used with bulk repertoire sequencing
data. The bulk RNA-seq data can be integrated in the single-cell
VDJ-dataframe with the function VDJ_integrate_bulk()
. This
function reannotates the germline genes of both the bulk and the
single-cell data using IgBLAST. Next, a combined VDJ-dataframe is
created of both the bulk and single-cell sequences using the IgBLAST
germlines. During this step, the FR1 regions can be trimmed from the
germline sequences and raw sequences to account for differences in
primer design between the sequencing methods. The bulk transcripts are
then merged into the existing single-cell clonotypes based on identical
CDR3 length, V and J gene usage, and CDR3 sequence similarity. The
function returns a combined VDJ-dataframe with the bulk sequences
integrated.
# Integrate bulk RNA-seq data into the VDJ dataframe, if multiple clonotypes match the bulk sequence, the bulk transcript is added to all clonotypes (tie.resolvement = "all)
VDJ_OVA <- VDJ_integrate_bulk(sc.VDJ = VDJ_OVA,
bulk.tsv = "path/to/bulk/OVA_bulkRNA.tsv",
bulk.tsv.sequence.column="VDJ_sequence_nt_raw",
bulk.tsv.sample.column="sample_id",
bulk.tsv.barcode.column="barcode",
bulk.tsv.isotype.column="isotype",
organism="mouse",
igblast.dir="/path/to/anaconda3/envs/igphyml/share/igblast",
tie.resolvement = "all")
#Create the lineage tree of the integrated VDJ dataframe
af <- Af_build(VDJ = VDJ_OVA,
sequence.columns = "VDJ_sequence_aa_trimmed",
germline.columns = "VDJ_germline_aa_trimmed",
construction.method = "phylo.network.default",
node.features = c("dataset"))
#Plot lineage tree
Af_plot_tree(af,
sample = "S1",
clonotype = "clonotype3",
color.by = "dataset",
label.by = "none",
main.title = "OVA S1",
sub.title = "clonotype3")
The AntibodyForests package originated from the Platypus
ecosystem of packages, where the fundamental function
Af_build()
was initially developed. Subsequently,
significant refinements were made, and they were migrated to this
standalone package.
Contributors:
* Alexander Yermanos
* Daphne van Ginneken
* Valentijn Tromp
* Tudor-Stefan Cotet