Publication Details

ID: 44

The Natural Products Magnetic Resonance Database (NP-MRD) for 2025.

Authors

Wishart DS; Sajed T; Pin M; Poynton EF; Goel B; Lee BL; Guo AC; Saha S; Sayeeda Z; Han S; Berjanskii M; Peters H; Oler E; Gautam V; Jordan T; Kim J; Ledingham B; Tretter ZM; Koller JT; Shreffler HA; Stillwell LR; Jystad AM; Govind N; Bade JL; Sumner LW; Linington RG; Cort JR

Journal/Conference

Nucleic acids research Vol. 53 (D1) , pp. D700-D708

Abstract

The Natural Products Magnetic Resonance Database (NP-MRD; https://np-mrd.org) is a comprehensive, freely accessible, web-based resource for the deposition, distribution, extraction, and retrieval of nuclear magnetic resonance (NMR) data on natural products (NPs). The NP-MRD was initially established to support compound de-replication and data dissemination for the NP community. However, that community has now grown to include many users from the metabolomics, microbiomics, foodomics, and nutrition science fields. Indeed, since its launch in 2022, the NP-MRD has expanded enormously in size, scope, and popularity. The current version of NP-MRD now contains nearly 7x more compounds (281 859 versus 40 908) and 7x more NMR spectra (5.5 million versus 817 278) than the first release. More specifically, an additional 4.6 million predicted spectra and another 11 000 spectra simulated from experimental chemical shifts were deposited into the database. Likewise, the number of NMR raw spectral data depositions has grown from 165 spectra per year to >10 000 per year. As a result of this expansion, the number of monthly webpage views has grown from 55 to 20 000 and the number of monthly visitors has increased from 7 to 2500. To address this growth and to better support the expanding needs of its diverse community of users, many additional improvements to the NP-MRD have been made. These include significant enhancements to the data submission process, notable updates to the database's spectral search utilities and useful additions to support better NMR spectral analysis/prediction. Significant efforts have also been undertaken to remediate and update many of NP-MRD's database entries. This manuscript describes these database improvements and expansion efforts, along with how they have been implemented and what future upgrades to the NP-MRD are planned.

Publication Info

  • Year: 2025
  • Publication Date: Nov. 22, 2024
  • Citations: 4
  • Source: Google Scholar

Identifiers

PubMed Data

Full Text

Most medicinal chemists and plant chemists define a natural product (NP) as a secondary metabolite isolated from natural sources that are produced by the pathways of secondary metabolism. However, many other fields of chemistry, biochemistry and life science have a much broader or more inclusive definition of NPs. At the Natural Products Magnetic Resonance Database (NP-MRD), we define an NP as any organic molecule (typically <5000 Da) that is fully or partially produced by living organisms. This includes any small or mid-size molecule generated and/or metabolized by living organisms, from bacteria, to fungi, to plants, to invertebrates to vertebrates (including humans). Using this more inclusive definition elevates NPs to a much higher level of economic, environmental and social importance. Indeed, NPs are the source of the bulk of organic matter on earth, serving as the main ingredients to the soil we stand on, to the food we put in our mouths, and to the trees that tower above us (

The isolation and characterization of NP structures has been of central interest to chemists and biochemists for >200 years (

The NP-MRD was initially established to provide a central repository for knowledge about NPs, to aid in the archiving NMR data about NPs, to support NP dereplication, and to facilitate structure elucidation of novel NPs for the NP community (

Growth of the NP-MRD. Panel (A) shows that >190 countries have accessed the NP-MRD. Panels (B)–(D) show the growth in registered users, web visits, and unique visitors per month.

To address this growth and to better support the changing needs of its diverse community of users, many important improvements to the NP-MRD have been made. These include major improvements to the NP-MRD’s data submission process that now allows users to deposit NMR data in a few seconds using an intelligent drag and drop data deposition system (

The first description of the NP-MRD, along with details regarding all the website’s functions and menu options appeared in 2022 (

Interestingly, <1500 molecules (<4%) in the NP-MRD’s first release had full NMR spectra with complete assignment data. Usually, these assignments were made at only one NMR spectrometer frequency. Therefore, to make the NP-MRD dataset more useful for other magnetic fields, curators took the reported chemical shift assignments for all compounds, including those from the JEOL CH-NMR-NP, HMDB and the BMRB and generated simulated NMR spectra at 10 different magnet field strengths (from 100 to 1000 MHz, in 100 MHz steps for

For the 2025 release, the NP-MRD curation team significantly expanded the number of compounds in the NP-MRD by including all non-lipid compounds in the latest version of HMDB (

After the compound annotations were completed, the chemical shifts for newly added NP-MRD structures were predicted using the latest version of the PROSPRE (

In addition to these ML-predicted NMR spectra, the NP-MRD curation team has also been calculating and compiling quantum-mechanically derived chemical shifts using density functional theory (DFT) methods implemented in the ISiCLE package (

Since 2022, the NP-MRD curation team (via backfilling) has provided experimental NMR assignments for another 594 compounds with 524 sets of

Comparison of the content statistics between the first release of NP-MRD and the latest release of NP-MRD

The distinction between experimentally measured NMR spectra (usually collected in a single solvent at a single magnetic field strength), simulated spectra (which use experimental chemical shift data and experimental J-coupling data to generate NMR spectra at multiple field strengths) and predicted spectra (which use ML methods to predict chemical shifts, J-couplings and NMR spectra at multiple field strengths) is important to note. Some compounds in the NP-MRD have all three types of spectra (experimental, simulated and predicted), others only have two types of spectra (experimental and simulated), and still others only have predicted NMR spectra. The last category covers the vast majority of compounds in the NP-MRD. Within the NP-MRD, each type of spectrum is appropriately labeled and each type is selectable (individually or in bulk). The most accurate and useful spectra are obviously experimental NMR spectra. The least accurate are predicted NMR spectra. However, given the enormous improvement in NMR spectral prediction accuracy achieved over the last few years, even predicted NMR spectra are now very useful for compound identification, dereplication and characterization. Indeed, predicted chemical shifts and predicted NMR spectra are certainly far more useful than having no data at all. In some cases, they may even be more useful than published chemical shift assignments, which may contain typographic errors or misassignments (up to 5% of published assignments in our experience).

The NP-MRD is an archival database. This means that it is designed to accept external (and internal), user-deposited data. When the NP-MRD was first launched, two parallel paths for data submission or data deposition were undertaken: one was called literature backfilling or retrospective data entry (to be done by NP-MRD curators) and the other was called prospective data entry (to be done by NP-MRD users in the NP community). The goal of the literature backfilling process was to fill the NP-MRD with thousands of previously published NMR assignment datasets of well-known or well-studied NPs. The data backfilling process involved a manual literature review conducted by trained NP-MRD curators to identify novel NPs with appropriate NMR data and then manually enter the NMR data into the NP-MRD using a specially developed NP-MRD data deposition system. On the other hand, the goal of the prospective data entry process was to capture NMR data of newly determined or newly published NPs as they appeared in the literature. The prospective data entry was to be performed by registered NP-MRD users via the same specially developed NP-MRD data deposition interface.

Unfortunately, several problems became apparent when the two systems were launched in 2022. Email-based efforts to identify or encourage members of the NP community to submit their newly published data to the NP-MRD received low response rates. Likewise, a manual review of past NP literature to perform back-filling tasks proved to be much slower and much more difficult than expected. Additionally, user feedback regarding the specially designed NP-MRD deposition system indicated that the system was too slow (taking >30 min for a typical deposition), required considerable manual input (especially with regard to chemical shift assignment entry) and was prone to frequent user errors. That deposition system was also limited in its ability to allow users to embargo release dates or support other common deposition/release requests. These problems led to a significant redesign of both the retrospective backfilling and prospective data entry process.

This redesign led to, first, an improved semi-automated literature tracking system and user reminder system to facilitate backfilling. Second, a faster, easier, ‘smarter’ drag-and-drop deposition system was developed to make data deposition painless, easy and fast (the so-called ‘carrot’ approach in the carrot-and-stick motivation theory). Third, arrangements with several journals (including the

More specifically, the revamped NP-MRD deposition system now includes: (i) automated literature tracking with natural language processing (NLP) tools (based on ML) to identify new papers describing new NPs; (ii) NLP to extract key data (compound names, source organisms, etc.) from article titles and abstracts; (iii) an automated email system that sends personally curated data requests and deposition links for all new NP articles; (iv) a flexible data deposition framework that accepts data from published articles, pre-submissions and private repositories; (v) extensive quality control and standardization tools to ensure that deposited data is correct, complete, and well standardized for uploading to the NP-MRD; (vi) data conversion tools for Bruker 1D NMR data and open data formats (nmrML, NMReData and JCAMP-DX); (vii) security features to protect against the distribution of malware; (viii) embargo management to allow depositors to control the release date for deposited data, and provide private links that can be shared with reviewers; (ix) a unified single exchange format (nmrML) for all data types (raw data, assignment data, peak lists, calculated chemical shift data, etc.) with standardization and validation tools to simplify data ingestion to NP-MRD; (x) end-to-end tracking of all deposited data, and detailed submission reports to administrators and users to provide NP-MRD accession numbers and highlight any issues with data ingestion; (xi) extensive tools for metadata extraction and validation from raw data, reducing data entry requirements for end users and improving data accuracy; (xii) linked secure account management between the deposition system and the NP-MRD database; (xiii) interface harmonization with the NP-MRD website; and (xiv) automated assignment of digital object identifiers (DOIs) for compounds with user-submitted NMR data. A flow diagram depicting the main components of the new data deposition/submission system is shown in Figure

A flow diagram depicting the two components of the NP-MRD’s new data submission system, including the literature tracking and reminder system (top panel) and the data deposition and integration system (bottom panel).

The new deposition platform not only supports all of these tasks and workflows, it has also dramatically shortened the deposition times from >30 min per compound to <5 min per compound. The consequences of these changes to the deposition system have been quite dramatic. Indeed, the number of NMR spectral depositions has grown from a 160 spectra per year in 2022 to >10 000 per year for 2024 (so far).

The new NP-MRD data deposition system supports the deposition of 1D (

Screenshots of the new NP-MRD data deposition interface with panels highlighting some of the key submission tools, including the (

The nearly 10-fold growth in the size of the NP-MRD has obviously led to much more useful NP data being available, browsable or searchable within the NP-MRD website. However, it has also led to significant challenges regarding database searching and querying—especially with regard to performance. Not only has the number of structures grown substantially (by hundreds of thousands), the number of chemical shifts in the NP-MRD has grown even more (millions). The total number of searchable entities in the NP-MRD now numbers in the tens of millions and given that many searches are often performed over a range of values or categories (chemical shifts, masses, formulas and biological origins), the number of combined searches or search combinations quickly becomes overwhelming and, therefore, incredibly slow.

To address these issues, several database redesign efforts were undertaken. First, the database was reorganized to support rapid look-ups through the creation of composite indices that indexed on frequently searched columns, such as spectral type, NMR-detected nucleus and spectrometer frequency. Second, each chemical shift in the database was converted into integer bins that were closest to the nearest 0.01 ppm for

Furthermore, because users may wish to perform different types of chemical shift or spectral searches, different search functions were also made available to reduce both the complexity and scope of each search. Some individuals may have only partial chemical shift sets, a few or even a single chemical shift and just want a list of potential matching/similar compounds. Other individuals may have a complete list of chemical shifts – for a single nucleus type (say

The implementation of these database, architecture, hardware and programmatic changes allowed search speeds to drop from 2–3 min per search to under 10 s per search against experimental NMR chemical shifts. These changes have also made the database more robust to continued expansion and to the expected growth over the coming 5 years.

Since the first release of the NP-MRD, a number of new and improved spectral utilities have been added to database. These utilities are intended to provide functions for NP-MRD users to facilitate spectral assignment, structure determination, compound dereplication and spectral generation or modeling. These include a dedicated

NP-MRD was developed using Ruby 2.5.1 and the Ruby on Rails web framework (

The core information stored within NP-MRD is converted into user-visible web pages through NP-MRD’s HTML interface responder. The dedicated server hosting NP-MRD website runs the Ubuntu 20.04.6 LTS OS on 8-CPUs with maximal speed of 3.3 GHz, 64 GB of RAM and 400 GB of disk space, ensuring good scalability and rapid searching. The NP-MRD website is connected to the NP-MRD database server that stores all the information and has similar specifications with the exception that it has a 5 TB hard-drive. Regular backups (once a week) and stringent protocols safeguard data integrity on both the website and database servers.

All data internally uploaded to the NP-MRD has been vetted and validated by multiple curators. Likewise, all members of the NP-MRD curation team were required to have at least an undergraduate degree in chemistry, biochemistry or NP chemistry. To monitor the (internal) data entry process, all newly added data are entered into a centralized, password-controlled database, allowing all changes and edits to the database to be monitored, time-stamped and automatically transferred. All curation team members were also given extensive training by the lead curator(s) in NP-MRD annotation via hands-on mentoring, text instructions, peer support and tutorials. All data externally uploaded (by users) are vetted through a series of automated data checking routines. Users are notified of potential errors or missing information by email and/or real time, online warnings. Additional annotations and chemical shifts assignments are checked manually. This process is aided by an internally developed ‘Curator App’ which helps to accelerate or semi-automate the process.

The NP-MRD has embraced DOIs as part of its data submission ecosystem. A DOI is a unique alphanumeric string assigned to digital objects, such as research papers, datasets or NP-MRD data submissions. DOIs provide a persistent link to their location on the internet. Unlike URLs, which can change, DOIs offer permanent and reliable access, ensuring consistent retrieval of content. Some of the other advantages of DOIs lie in the fact that they simplify citation and linking in scientific writing by providing a standardized reference format and are associated with rich metadata (e.g. author, title and publication date), enhancing discoverability in databases and search engines. Additionally, DOIs enable tracking of citations, downloads and other usage metrics. DOIs are now being assigned to NP-cards with all user-submitted experimental data. Because of their widespread use in academic publishing, NP-MRD users can use these DOIs to support their manuscript submissions. In total, 3070 DOIs have been issued at the time of this writing.

The NP-MRD is FAIR compliant (

As highlighted here, the NP-MRD has undergone significant growth (by nearly a factor of 10) in the past 3 years. Even more dramatic growth has been seen in the number of user-generated data depositions, with a nearly 100-fold increase. This has even been exceeded by the number of page views, which has seen a nearly a 300-fold increase. This growth has led to a number of unexpected challenges. However, as outlined in this manuscript, they have been successfully addressed leading to the creation of a much more user-friendly, faster, more accurate, and more resilient NP data resource. The decision to include both primary and secondary metabolites within the NP-MRD has made the NP-MRD more broadly appealing. Likewise, the decisions to draw from many well-known and well-regarded databases to obtain more diverse chemical data has led to a much richer, more complete data resource. The decision to address data deposition bottlenecks with a more intelligent, more intuitive system has led to a number of improvements. Data deposition has now been greatly simplified, which has played a large role in the significant increase in external user depositions. Data backfilling has been greatly improved and accelerated, which has increased the volume of experimental assignment data available in the NP-MRD. Data search speeds and search capabilities have been significantly enhanced, making that database far more useful for structure discovery and matching. A number of key NMR utilities and spectral predictors have also been enhanced, all of which are improving the look, content, user-friendliness and utility of the database. However, much still remains to be done.

In addition to these database coordination activities, some of the other planned improvements to the NP-MRD include developing much more sophisticated automation for database operation and data deposition. This automation is needed for the NP-MRD to efficiently scale up its operation as rates of deposition and retrieval increase. This will include the development of automated processes and APIs for improved querying and data quality review to allow team members to easily update and correct database entries. New support will also be added for handling and depositing large NP datasets. Additionally, we will implement automated peak assignment for deposited NMR data using modifications to the newly developed chemical shift and J-coupling predictors. Greater use of ML, chemical ontologies [ChemFOnt (

Another key development will be a focus on adding interoperability to the NP-MRD with other spectral and NP databases. Indeed, the NP-MRD is not the only archival NMR database available. Other NMR data deposition and analysis platforms, with somewhat different mandates and goals, are available to the NMR community, such as NMRShiftDB (