Stephen Bryant graduated from the University of Virginia with a BA in 1976, having majored in Chemistry and English. He completed his PhD in 1981 at Johns Hopkins University School of Medicine, in the Department of Biophysics under Mario Amzel and Roberto Poljak on the subject of protein crystallography. He then took up a position as Associate in the Department of Biostatistics and Director of the Academic Data Center at Johns Hopkins University School of Hygiene and Public Health, and a postdoctoral position at Birkbeck College, University of London, from 1985 to 1986.
On completion of his London post-doc, Bryant moved to the Protein Data Bank at Brookhaven National Laboratory where he stayed until 1988. Three years as a Research Scientist at the Wadsworth Center for Laboratories and Research, in the State University of New York at Albany, took him to his present position of Senior Investigator, in the Computational Biology Branch of the National Center for Biotechnology Information at NIH where, in 2004, he saw the inauguration of the PubChem small molecule repository. He talked to David Bradley about how and why PubChem was started, what it hopes to achieve, and how it is addressing some of the problems that have arisen since its inception.
What was the initial inspiration for PubChem?
PubChem was started at the specific time it was because it was part of a research program called Molecular Libraries Roadmap Initiative, which gives grants to university researchers, with the aim of discovering chemical probes through high-throughput screening of small molecules that modulate the activity of gene products. PubChem was to be the public repository for this effort and it was modeled on the Human Genome Project and GenBank, so that the results would be available publicly for the greater good of biological research. There are two main parts to that, chemical structure information and biological assay results, which some visitors to the site did not realize was there! In terms of development, it is the assay side that is the demanding part, as each assay is different and fitting it to a one-size fits all storage and retrieval system is a challenge.
The value of that information would be improved if we could combine it with other sources of information, so we take chemical structures from whoever is willing to provide them and do our best to link them to the literature and to the gene sequences and protein structures and so on.
Why do you personally feel PubChem is important?
I have worked at NCBI for just about fifteen years and have become increasingly involved with information resources because that is the most valuable thing we can do with computers and molecular databases-make the information as accessible to researchers as we can. There was a whole world of information about the bio activities/properties of small molecules that was not included in our retrieval systems in as good a way as it could have been. I thought it be worthwhile to do as it would have a major impact on research.
Why shouldn’t researchers pay for this information?
Well, I’m not the one who makes all the science policy decisions, but it is interesting to look back 25 years or so when it became technically possible to use computers to make molecular databases. The biologists made GenBank and Protein Databank as public repositories, but in the chemical world at the same time the same technology was used to create commercial information services. So, there were two models about information access, although why that should be no one can really say. One factor may have been that at that time, biologists didn’t think of sequences or proteins as commercial products whereas in chemistry there was a long history of paying for information about molecules because there were obvious commercial opportunities.
When it comes to molecular libraries, the decision of Francis Collins, Director of the National Human Genome Research Institute at NIH, and colleagues was to follow the biological model of the Human Genome Project and GenBank and to make the information freely available. That said, there are classes of information, such as business information and patent abstracts, that we have no prospect of ever being able to add to the system.
An advisory group of industry representatives got together in December 2005 (fourteen from different companies, such as MDL/Elsevier); most of them considered that PubChem could be useful to them and a few had already started structure information and backlinking to their own commercial websites – tying those two worlds together. MDL/Elsevier, for instance, is backlinking the structures it adds to its XPharm product which provides pharmacological reviews by target molecule and also links into the synthetic literature and the physical properties of such molecules.
That kind of initiative suggests that gathering the threads of information together could become more transparent for users, is that the case?
That’s kind of the vision we’ve had for PubChem, in that we see it as a cross-referencing point rather like the PubMed system that ties searchable abstracts to the full-text of the journals literature and has become a kind of cross-roads. I hope that will happen with PubChem too. We’ve got about forty organizations at present that provide structure data and, in most cases, links to their systems. Conversely, the journals themselves, Nature Chemical Biology for instance, are starting to add to PubChem the molecular structures referenced in each article. This is exactly what we want, a user could search by similarity and find a paper on a similar compound.
How quickly are entries now being added to PubChem?
The number of entries has gone in jumps and spurts. For instance, in the Fall of 2005, two academic groups added their large collections, Zinc and ChemDB, which are openly derived from vendor catalogs, to the system (with vendor permission). That venture pushed the number of substance records to over seven million; currently we have around 8.5 million substances, representing about 5.5 million unique structures. I expect that to plateau at ten million or so as vendors add their own structures directly too.
Will some structures remain beyond the reach of PubChem?
Yes, there are structures in the chemical literature that we will probably never have because an abstracting operation is not part of PubChem, and many synthetic journals, for instance, are beyond the scope of PubMed.
How many users do you have?
We just crossed the point of 10 000 unique internet addresses (IP) accessing PubChem on a daily basis. That translates by a factor of three in our guesstimate of actual individual users because Harvard University, say, looks like one IP address to the system. That factor of three is based on a little bit of past investigation but is, strictly speaking, a guess.
What kind of feedback are you getting from users?
The NCBI helpdesk fields a fair number of questions and comments about PubChem, so we do get good feedback that reveals what aspects confuse users and whether it’s documented well enough. We tend not to hear from those who don’t have a problem, but it really helps with refining the services.
Might PubChem be able to tie in the wider world of chemical information?
Academic institutions, of course, have their own technical reports, theses, and other chemistry text sources that could be cross-referenced using the semantic web that Cambridge’s Peter Murray-Rust and his colleagues are working on. I don’t know to what extent that will come to be, what we do see is that a lot of the web “referrals” to PubChem are coming from other sources already and to that extent the semantic web is happening already. Moreover, PubChem has used the InChI strings to provide a unique identifier for each entry since the start. Users can find exact matches that way.
Are there any tricks to speeding up an InChI search on PubChem?
If you put the complete string in quotes and the word “inchi” in square brackets at the end that will bypass all the normal search processing and find the specific InChI directly. The InChI developers at NIST have also just sent us a beta version of software that can decode an InChI and produce a structure, which means users will soon be able feed InChI strings into our similarity search engine.
There were issues regarding the freely available PubChem competing with commercial products, such as the Chemical Abstracts Service (CAS) database, have those been resolved?
I’m optimistic about that. The American Chemical Society, CAS’s parent organization, participated in our advisory group meeting in December and that group will meet again later this year. This gives us a way to ensure the chemical information organizations are informed about what we’re doing and they can express any concerns. I hope that now that the communication channels are open, we can address those concerns.
Some users have pointed out errors in PubChem on the CHMINF-L discussion group, are those being rectified?
The errors concerned were to do with the name to structure association and were actually present in the information that was deposited by an external organisation. In the particular example cited, structures were deposited from a crystal structure repository source where precise structural information for the non-polymeric ligands and co-factors is not necessarily stored. Specifically, there are no hydrogens and no bond orders for those molecules, so the error in question arose because the compound was named as an aldehyde but the structure our automated process identified was the corresponding alcohol – the only difference described in the crystallography data was a bond length.
How might such errors be avoided?
Well, we don’t have a staff of curators, we’d have to be CAS to begin checking the eight million records coming in, so those kinds of errors can go through and it’s just part of the nature of an open deposition system. There are errors in other databanks and that is just part of the science, the science of determining these things just cannot be 100%. We cannot be perfect in these name-structure connections; for that and other reasons, for example, if you look up acetol you will find the trivial name hydroxyacetone, but you will also find aspirin, because Acetol is one of the brand names for aspirin in some countries. So, you ask which is the right name, and you find even the chemists cannot agree.
Perhaps a Wikipedia approach could be used to help curate PubChem?
We have a mechanism in place called Linkout, so that anyone can tie a PubChem record to a particular website and these can be batch submitted. This TPA (third-party annotation) approach might offer a way for other opinions or additional information to be provided. A practical approach to the name-structure issue might involve adding search filters that allow a user to see names specific to particular sources, for instance, knowing that a particular depositor uses a specific name to refer to a particular compound. This might give us a practical way to take information from different depositors where trivial or non-standard chemical names are involved.