Edit |Violet.
Data-driven, deep Xi algorithms can accurately characterize advanced quantum chemistry molecules. However, their inputs must be limited to the same level of quantum chemical geometric relaxation as the training dataset, limiting their flexibility. Employing alternative cost-effective conformational generation methods introduces domain-shift issues that reduce accuracy.
Recently, researchers from Seoul National University in South Korea proposed a domain-adaptation method based on deep contrasting Xi Xi, called local atomic environment contrastive learning (LACL). LACL learns Xi mitigate the differences in distribution between the two geometric conformations by comparing different conformational generation methods.
It is found that LACL forms a domain-independent latent space that encapsulates the semantics of the atom's local atomic environment. LACL achieves quantum chemistry precision while avoiding geometric relaxation bottlenecks, enabling future application scenarios such as reverse molecular engineering and large-scale screening. The method can also be generalized from small organic molecules to long chains of biological and pharmacological molecules.
The study, titled "Deep Contrastive Learning of Molecular Conformation for Efficient Property Prediction," was published in Nature Computational Science on December 4, 2023.
*Link: Machine-based Xi-based optimization methods, such as strong chemical Xi, active Xi, and deep generative models, have aroused research interest in reverse material design and drug discovery. In order to fast** the quantum chemistry of unknown molecules at a low computational cost in these applications, graph neural networks (GNNs) have become a popular and successful model.
In order to effectively train machine Xi models, high-quality datasets such as the QM9 dataset consisting of 134,000 small organic molecules have been published.
In large-scale inference scenarios such as high-throughput screening, preparing the input molecular geometry through DFT is not only time-consuming and costly to converge, but also a bottleneck for using the training model. Conformations calculated using the computationally efficient Merck Molecular Force Field (MMFF) optimization method or ML-based conformation generation model can be considered as an alternative. However, in this case, the ML model suffers from domain drift because it deviates from the distribution of the training data of the previously learned Xi computed by DFT.
Figure: Comparison of the molecular** approach of the predecessor method with the LACL method. (*
In this study, the researchers introduced a local atomic environment representation Xi model (LACL) based on deep contrast Xi, specifically designed to solve the problem of domain shifts in molecular data. LACL captures similarities between molecular data using a computationally efficient geometric relaxation method and DFT molecular geometry data. In this way, LACL takes full advantage of the potential of quantum chemistry data and bypasses the computational bottlenecks associated with geometric relaxation from scratch.
The study uses the QM9 and QMUGS molecular properties** benchmarks to validate the domain fitness of the model. LACL accurately** molecular properties based on low-fidelity geometry, reducing computational cost and inference time while maintaining quantum chemistry accuracy.
Here, the researchers define the term geometric domain as a statistical distribution of the geometric conformation of a molecule, including the interatomic distances or triplet angles generated by certain methods. In this study, the researchers consider the conformation computed from the de novo calculation method, which contains the initial knowledge present in the existing benchmark data, as the source domain. In addition, conformations obtained from computationally valid force fields or conformational generation models based on machine learning Xi are considered as target domains. The main goal is to bridge the gap between the source and target domains, enabling the model to generalize what it learns from the source domain in order to make accurate ** in the target domain, despite the domain changes.
Figure: Overview of the LACL model. (*
In order to capture the subtle differences between the two geometric domains, the three-body interaction is explicitly modeled by modifying the Atomic Line Graph Neural Network (ALIGNN) model utilizing the line graph framework. Contrast-Xi methods compare the enhancement of the local atomic environment represented by the nodes, rather than the enhancement of the molecule as a whole. LACL is developed based on the BGRL framework. This is an advantage considering that the edge features of the molecular line diagram take up a lot of computational memory. LACL trains end-to-end throughout the pipeline while minimizing BGRL loss and target attribute loss to prevent crashes. This training strategy provides an effective way to learn Xi molecular diagram representation to view the characteristics from different perspectives of molecules.
LACL demonstrated its ability to leverage DFT geometric domain information to enhance MMFF geometric domain conformation**. This improvement is meaningful because it shows that it is possible to achieve quantum chemical precision (less than 1kcalmol 1 error) with mmff-level relaxation alone without additional optimization. These results provide an opportunity to find the best conformational generation method between precision and computational efficiency.
The investigators also evaluated the generalization ability of LACL for open and compact conformational isoforms. Even taking into account the low number of molecules tested, the results are in good agreement with the trend observed with the previous 1,706 test molecules, and overall, LACL shows excellent** performance. Of particular note is its robust performance in open conformational isoforms, which is obtained by manipulating raw data. This quantitative experiment suggests that the direction of research in the search for domain-independent representations may be extended to more complex systems, such as proteins and peptides.
Figure: Study of LACL performance in open and compact conformations. (*
To investigate the Xi implications of the learned local atomic environments (i.e., node-level embeddings), the researchers used T-SNEs to visualize the relationships between these environments in two-dimensional space. The results show that the local atomic environment is less dependent on the atomic number of the atoms, and that atoms with similar structural characteristics form clusters, rather than grouping according to the properties of the molecules themselves.
Figure: LACL Xi curves for the QMUGS20 dataset. (*
In the calculation of quantum chemical properties in the ground state, the LACL method can be a viable alternative to minimize the additional optimization process of complex molecular geometries. The rapid development of generative AI has led to the emergence of generative models for molecular conformation. However, it remains a huge challenge to achieve data distribution equivalent to a de novo conformation such as DFT, which highlights the importance of domain adaptation strategies. This study provides an opportunity for rapid and accurate quantum-chemical properties.