Edit |Cabbage leaves.
Will protein structure search tools like AlphaFold replace protein sequence search with BLAST?A research team from the Technical University of Dresden discusses the prospects for remote homology detection using structure search and why protein blasts should strive to incorporate structural information as a leading sequence search tool.
BLAST is widely used in molecular biology to search for nucleotide and protein sequences. Thirty years after the launch of BLAST, there has been a major breakthrough in the structure **, with tools like Rosettafold and Alphafold appearing.
As a result, each protein sequence in the primary sequence database now comes with a 3D folded model. While this does not affect the (non-coding) nucleotide sequence, it raises the question of whether the search for 3D protein structures will replace the search for protein sequences. Is protein blast a thing of the past?
While Blast Search is a powerful tool for functionality, its capabilities are limited. Sequences can be processed to degrade significantly, but still fold into similar 3D structures that perform the same or similar functions.
Different sequences, same structure
Examples of such protein pairs can be found in the adhesion molecules of algae and bacteria, specifically in the diatom adhesion protein catrailin 4 and the bacterial ice-binding protein ffipp. The pair has no sequence similarity detectable by BLAST (e-value 0.).30, where the e value is > 0001 is not considered significant).
In fact, even more granular sequence-based tools, such as hhblits, can't build relationships. However, the ** structure of catrailin 4 and the known structure of ffibp are very similar, as both adopt the topological characteristics of helical folding, which consists of two units held by a helix-binding protein.
Figure 1: FFIBP (A) Catrailin 4 (B) and RAD52 (D) Red (E) have poor E values of approximately 03。(*
This structural similarity can be measured by the so-called Template Modeling Score (TM-score), which combines RMSD (root mean square bias) and alignment length as an interpretable score. Greater than 0A TM score of 5 means that the two structures may adopt the same fold and be evolutionarily related. Catrailin 4 and FFIBP have a TM score of 06 (higher than 0.)5 cut-off). Thus, structural comparisons can reveal this striking similarity, which is still elusive for BLAST and other sequence-based tools such as hhblits.
Another example involves DNA recombination, which is the fundamental process of replication in which single-stranded annealed proteins (SSAPs) play a central role. For more than two decades, there has been a skeptical and controversial discussion about whether RECT RED, ERF, and RAD52 form three distinct superfamilies, or just a superfamily. The former view is supported by sequence analysis, which shows no significant similarity between RECT RED, ERF, and RAD52. In fact, rad52 and red have no similarity detected by blast (e value 0.38)。
Considering the structure changes the situation. The Al-Fatlawi team juxtaposed representative structures of RECT RED, ERF, and RAD52 and the results showed that despite the lack of sequence similarity, these structures contained a core structural element. It is the core of the oligomerization reaction as it generates a ring and a helical structure respectively. As a result, it is very conserved in RECT RED, ERF, and RAD52 and can be identified by structural similarity (TM score of 0.).5) detected, despite the lack of any sequence similarity (see Figure 1D-F).
Structure** to the rescue
These examples suggest that alphafold may be able to intervene in areas where BLAST can't find significant similarities. Thus, the question arises: how to achieve this goal systematically?To this end, tools such as Foldseek, DALI, and 3D-AF-Surfer have emerged, which use autoencoders, distance matrix alignment, and dedicated fingerprints, respectively, to scan and compare structures.
While these tools already exist, they still need to be broader and simpler to compete with BLAST searches on sequence databases. Synergies are needed to integrate them into the classic blast sequence search. Recently, a study compared the penultimate best BLAST hits with the penultimate best structure hits and took the first steps in this direction by performing a nearest neighbor search for the machine-Xi embedding of sequences.
To explore the potential of this advanced tool, the researchers wanted to understand how membership criteria in the same superfamily are linked to sequence and structural similarities. As a result, scientists have obtained 11,211 domains with superfamilies from the Scope database. These form 62,278,380 domain pairs, of which 225,931 (0..)36%) belong to the same superfamily and can therefore be considered homologs.
How many of these homology pairs can be found directly by sequence and structure, respectively?The e-value cut-off is 0At 001, BLAST recovered 225,931 pairs (16,300 pairs) from 7 pairs. Widening the threshold to 1, that number increases to 25,634 (11%). But even if the e-value is < 10, it will not exceed 15%. These numbers are greatly improved if more sensitive sequence-based methods (e.g., hidden Markov models) are considered. In fact, HHBLITS was able to retrieve 175,682 pairs (78%) under optimal conditions, which is even better than by structural comparison (TM-score > 0.5) The 164,468 pairs found (73%) are better.
However, what about the 62,052,449 pairs that do not belong to the same superfamily?Of these pairs, there are 0, 9,053, and 72,329 pairs with e values less than and 10, respectively. hhblits are identified in this 25%, while false detection of structural alignment is limited to less than 2%. HHBLITS had an AUC of 77% and a structural comparison of 95%, compared to 44% for BLAST. Higher AUC scores indicate that the classifier is more efficient at correctly assigning higher scores to proteins in the correct superfamily compared to proteins in other superfamilies.
Although 95% AUC for structural comparisons may be encouraging, the availability of high-quality structures can be a limitation. It is estimated that 30% of eukaryotic proteins contain disordered regions of 50 or more consecutive amino acids, which is expected to be of poor quality in 3D structures**. These regions are suitable for sequence searches using BLAST, but not for direct structure searches.
To assess how such a large percentage can be scaled to the entire AlphaFold database, the researchers calculated the average confidence score for all AlphaFold constructs. The researchers found that 80% of the alphafold structures had a PLDDT confidence score of 70% or higher, meaning they could be modeled well with an overall good backbone**. This means that there is a large amount of structural data of the right quality.
BLAST, Things to Come
BLAST perfectly meets many of the needs of biomedical researchers, such as the detection of variants and closely related sequences. However, the specific problem of remote homology detection is difficult for pure sequence searches.
Here, the structure can go further than the order. Researchers evaluated this relationship of sequence and structural similarity through demonstrative analysis of millions of pairs of domains. All in all, the analysis showed that BLAST with strict e-values was very precise in finding congeners, but not comprehensive. The hidden Markov model is more sensitive, but has limited specificity. The structure balances these two extremes. If a blast search contains structural data, it can expand the number of hits that have similar** structures and may be candidate congeners without compromising the quality of the results.
How structural data can be integrated into sequence search is unclear, but one approach that seems feasible is not to use structural data directly, but indirectly through so-called embeddings, which are intermediate sequence representations generated by neural networks that form the basis of neural network structures**.
However, homologous assays based on embedded and structural data will only help transform molecular biology if they are available in an easy-to-use manner and widely adopted by the community. Prominent institutions such as NCBI, EBI, and RIKEN should now strive to adopt the fast structure search implemented in Foldseek, or use embeddings to extend the classic BLAST-based protein sequence search so that Protein Blast continues to be the trend of the future.
*Links: