Today I will tell you about an article published in Nucleic Acids Research in December 2023Using language models to learn the biological significance of BCR sequencesarticle. The authors used different embedding methods to extract the characterization of the BCR sequence, and evaluated the performance of multiple embedding models, and found that most of the embedding methods were effective in capturing the properties and specificity of the BCR sequence. InReceptor specificIn terms of aspects, the immune2vec model extracts BCR-specific embedding representations**, which is slightly better than the general protein language model. This can be seen as providing insight into the downstream tasks of antibody analysis and discovery.
B-cell receptor (BCR).It is a protein structure located on the surface of B cells that plays a key role in the immune system. Immune systemThe main task is to identify and respond to pathogens inside and outside the body, such as bacteria, viruses, and other pathogenic microorganisms. BCR plays an important role in this process.
Overall,The function of the BCR is to initiate a specific defense response against pathogens in the immune system。Through the diversity and specificity of BCR, the immune system is able to recognize and fight a variety of different types of pathogens, protecting the body from infection and disease. existingNLP methodBy learning the embedded representations of amino acids to generate specific representations in downstream tasks. This type of method works by breaking down each B cell receptor (BCR) into smaller units, i.e., combinations of three amino acids (3-MERS), and then embedding each unit into a fixed-length sequence representation. This is followed by averaging over the entire sequence to generate a single vector for a given BCR.
The methodPatterns in the BCR sequence can be recognized, including the specific sequence features of the complementary determinant region (CDR). This is for you**Binding of BCR to antigen and other functional propertiesCrucial. However, biological data labeling is expensive and may not be sufficient in some cases. This makes some deep learning methods that require a large amount of labeled data to be limited when learning BCR sequences.
2.1 Data collection and pre-processing
The authors were from tenDatasetsOne million full-length BCR sequences with only one heavy and light chain were collected. The median length of the heavy and light chains is 122 and 108 amino acids, respectively. The author exploits furtherimmcantationSequences were annotated for somatic high mutation frequency and CDR3 length. In addition, information related to the SARS-CoV-2 spike protein tag was obtained from the dataset for receptor-specific** tasks. ForBalance the data set, 1000 sequences from each donor were randomly selected from previous COVID-19 datasets as negative samples for specificity**.
2.2**Tasks
In classification tasks, authors useSupport vector machine classifier (SVC) with RBF kernel functionsDivide the data into:TrainingValidationwithTest set。In order to search for the optimal parameters of the model, it is still in SVCRegularization parametersA grid search was performed, and the optimal parameters were selected based on the weighted average F1 score of the validation set.
In the classification task, it is selectedLinear model with LASSOand the regularization parameters of the regressor were searched in a grid, and finally the performance of the model was evaluated according to the RMSE and correlation on the validation set.
3.1 Evaluate receptor-specific** tasks
The authors extracted BCR embedding characterization for different models (ESM2, Prott5, Antiberty, etc.) are inClassificationwithRegressionThe performance in the task was analyzed. Figure 1A illustrates the different embedding models that encode the BCR amino acid sequence into a specific vector, throughSupervised machine learning modelsto assess heavy or light chains, or receptor specificity**. In addition, the embedding characterization was used as an example of the receptor-specific ** task (Figure 1B).Cross-validationIn this way, the optimal model parameters are selected.
Fig.1 BCR amino acid intercalation sequence attributes** and receptor-specific** tasks.
3.2 The importance of embedding characterization for SARS-CoV-2 specificity
AuthorThe effects of different BCR embedding characterizations on receptor-specific aspects of the SARS-CoV-2 spike protein were evaluated。First, BCR sequences with binding information to SARS-CoV-2 wild-type spike protein were searched from the antibody database (CoV-Abdab), and 1000 sequences were randomly selected for each donor as non-conjugates. Finally, the ability of the embedding method** coronavirus spike protein specificity was evaluated based on 15,538 sequences.
Author'simmune2vec modelLearn specific embedding representations for each sequence and place them on different sequence inputsUMAP visualization。(Figure 2a).
Figure 2b shows a boxplot of the F1 score for assessing the effect of the quintafold cross-validation in the receptor-specific task. Previous studies on BCR specificity** have typically focused on the CDR3 region of the heavy chain. Thanks to the advent of single-cell technology, it is possible to introduce more structures in regions other than CDR3, helping to introduce more reliable specific** results. can be foundWhen performing with full-length sequences, the BCR-specific language model outperformed the general-purpose protein language models ESM2 and PROTT5。In order to understand the effect of the latent dimension size of the immune2vec model on the receptor-specific task, the authors also performed corresponding experiments (Fig. 2c) and found that with the increase of dimensionality, the performance first increased and then decreased slightly. Similarly, for shorter sequences with less information, the performance degradation is more pronounced in higher dimensions.
Figure 2 Model performance for receptor-specific tasks using BCR-embedding.
The author proposes oneimmune2vec modelUsed for:**BCR sequence properties and specificity for the SARS-CoV-2 spike protein。The performance of different models in learning BCR sequence embedding representation was further tested. In terms of model architecture, although all methods encode some sequence properties and specificity, embedding representation based on protein language models is better by learning the representation of amino acids based on the context of the sequence. In addition, in the Sequence Attribute task, it was found that the immune2vec model with a higher latent dimension learned more from the sequence and performed better;However, it is not as effective as it is for specificity**.
In general,Language models are superior to traditional amino acid codingAmong the specificities of the SARS-CoV-2 spike protein, models such as immune2vec and antiberty are somewhat superior to general protein language models, and the combination of full-length and light chain sequences can improve specificity** performance, which also provides a unique perspective on the use of BCR embedding for downstream tasks.
References
ostrovsky-berman m., frankel b., polak p., yaari g. immune2vec: embedding b/t cell receptor sequences in ℝn using natural language processing. front. immunol. 2021; 12:680687.If you find any copyright infringement or other misinterpretation of the published content, please contact AIDD PRO (please add ***sixiali fox59) for deletion and modification.
This article is original content, unauthorized prohibition**, after authorization**, also need to indicate the source. If you have any questions, please send an email to sixiali@stonewisecn