Publication Details
ID: 41Accurate Prediction of (1)H NMR Chemical Shifts of Small Molecules Using Machine Learning.
Authors
Sajed T; Sayeeda Z; Lee BL; Berjanskii M; Wang F; Gautam V; Wishart DS
Journal/Conference
Metabolites Vol. 14 (5)
Abstract
NMR is widely considered the gold standard for organic compound structure determination. As such, NMR is routinely used in organic compound identification, drug metabolite characterization, natural product discovery, and the deconvolution of metabolite mixtures in biofluids (metabolomics and exposomics). In many cases, compound identification by NMR is achieved by matching measured NMR spectra to experimentally collected NMR spectral reference libraries. Unfortunately, the number of available experimental NMR reference spectra, especially for metabolomics, medical diagnostics, or drug-related studies, is quite small. This experimental gap could be filled by predicting NMR chemical shifts for known compounds using computational methods such as machine learning (ML). Here, we describe how a deep learning algorithm that is trained on a high-quality, "solvent-aware" experimental dataset can be used to predict (1)H chemical shifts more accurately than any other known method. The new program, called PROSPRE (PROton Shift PREdictor) can accurately (mean absolute error of <0.10 ppm) predict (1)H chemical shifts in water (at neutral pH), chloroform, dimethyl sulfoxide, and methanol from a user-submitted chemical structure. PROSPRE (pronounced "prosper") has also been used to predict (1)H chemical shifts for >600,000 molecules in many popular metabolomic, drug, and natural product databases.
Publication Info
- Year: 2024
- Publication Date: May 24, 2024
- Citations: 8
- Source: Google Scholar
Identifiers
- DOI: 10.3390/metabo14050290
- PubMed ID: 38786767
- ISSN: 2218-1989 (Print) 2218-1989 (Electronic) 2218-1989
- Google Scholar ID: T8_be82Iz5gC
PubMed Data
Additional Information
- Publication Type: Journal Article
- Language: eng
- Last PubMed Update: April 22, 2025
Full Text
NMR is ideal for determining the structure of small organic molecules, both natural and synthetic. This is because NMR spectra are characterized by sharp, well-defined peaks that can be directly associated with specific atoms within a given molecule. These peaks correspond to the chemical shifts, which can often be assigned to specific atoms or atomic groups in the molecule of interest. NMR chemical shifts, including
As a result, NMR has become routinely used in the determination of novel structures prepared via organic synthesis, in characterizing newly discovered compounds or contaminants [
The intention of these experimentally collected NMR spectral libraries is to help others more easily characterize novel compounds or characterize/quantify known compounds using NMR analysis. Specifically, by matching or partially matching measured NMR spectra to experimentally collected NMR spectral reference libraries, it is hoped that the chemical shift assignment of new compounds can be facilitated, or the identification of previously known compounds can be rapidly performed. Unfortunately, the number of available experimental NMR reference spectra for applications in NMR-based metabolomics, NMR-based medical diagnostics, or NMR-based drug-related studies is quite small. For instance, in the field of metabolomics, fewer than 1000 compounds with high-quality NMR spectra have been deposited into the HMDB [
To address this gap between measured experimental NMR data and known structural data, a number of individuals have proposed “in silico” or “reference-free” approaches to small molecule characterization [
NMR chemical shift prediction is nearly 70 years old [
Structure similarity methods use databases of structure fragments and their chemical shifts to predict
More recently, QM calculations that employ Density Functional Theory (DFT) techniques have become particularly popular [
ML-based approaches to predict NMR chemical shifts are often 100-1000X faster than QM approaches and offer similar accuracy. The first ML methods used relatively simple Artificial Neural Networks (ANNs) [
Our own experience in building experimental NMR spectral databases for HMDB, NP-MRD, and DrugBank showed that many of the training datasets used in previously published ML-based methods had significant problems with erroneous chemical shift assignments, incorrect chemical shift referencing, and a lack of appropriate accommodation for solvent effects. We hypothesized that by correcting for these database problems, the accuracy of
Accurately predicting
The training dataset consisted of 577 molecules with complete 3D structures (with attached protons) and fully assigned
Two holdout sets, not previously seen by our ML model, were used to test the performance of the different trained ML models for
The second holdout dataset consisted of 22 organic compounds that were chosen at random from the NP-MRD database. These 22 compounds had a total of 442 experimentally determined
A persistent problem with chemical shift assignments is that there is no standard or consistent way to label which atom numbers are assigned to which
To overcome these problems, we first used a program called Atom Label Assignment Tool using InChI String (ALATIS) [
After completing the structure “cleaning” and remediation process, we then manually checked all the
To train our
Our GNN was implemented utilizing Keras (version 2.3.1) [
All of the training data for our
To evaluate PROSPRE, we first assessed the improvement achieved via fine tuning of our GNN on the training set of 4027
The high quality of PROSPRE’s
We programmed PROSPRE as a comprehensive suite to support the prediction of
To operate the PROSPRE webserver, users must provide: (1) a SMILES string or SDF file, which can be directly pasted into the MarvinJS applet (or users can draw the structure into the MarvinJS applet), (2) the type of solvent, and (3) the reference. For the type of solvent, users can choose from methanol, water, chloroform, or dimethyl sulfoxide from the dropdown menu. For the type of reference, users can choose from TMS, DSS, or TSP. After pressing the “Predict” button, the submitted structure and predicted
Our results demonstrated that using a carefully curated “solvent-aware” training set of experimental
To test this hypothesis, we used ClassyFire (version 1.0) [
Therefore, future efforts will be focused on accumulating
In addition to making PROSPRE freely available as an easy-to-use webserver, we have applied PROSPRE to the prediction of