Wednesday, 8 December 2010

Name that stereochemistry - When Mol files go wrong

Here's an image from PubChem. I count two chiral centers, but for how many of these is the chirality specified?

As Egon pointed out in a comment on the original post, it's impossible to interpret this image without assuming the use of a particular wedge/hash bond convention. Either the stereochemistry is defined at both edges of the wedge, or it's defined only at one end.

The problem with MOL files

But this isn't just a problem with images of molecules; this same problem affects 2D MOL files. The underlying MOL file for the above molecule has that same bond marked as a wedge. Any time this happens to a bond connecting two chiral centers, there is a resulting ambiguity that requires a particular convention to be assumed. If your primary means of storing chemical data is a 2D MOL file, you should start feeling nervous right about now.

To be fair, the example discussed is really an isolated case in PubChem. In the case of the first 23071 molecules in PubChem, there are 14362 bonds connecting two chiral centers, but there are only 21 instances where they have been marked as wedge or hash. (I note that for all of these cases, it was possible to choose a different stereobond to avoid this problem.)

Another database whose primary means of storing chemical information is 2D MOL files is the ChEMBL database. This contains 635933 molecules, with 482773 inter-chiralcenter bonds. Of these bonds, there are 7335 marked as stereobonds. In other words, more than 1% of the molecules have an ambiguous stereocenter, simply because of the way the stereochemistry was encoded into the MOL file. This is probably a bit high, and I expect that ChEMBL will either fix this or point out an error in my calculations.

Interpret this

Okay. So we know the problem. But we still need to answer the original question...what stereochemistry was intended for the molecule in question?

First, here's the PubChem record.

The most common (and recommended) convention for handling a wedge/hash bond connecting two stereocenters is to consider the stereochemisty as defined only at the thin end of the wedge. If you look at the SMILES string calculated by OEChem, and the InChI string (calculated by the official InChI binary?), you will see that this is how the MOL file was interpreted.

But, here's the thing...

I think that the convention that the generator of the MOL file was following was that the stereochemistry was defined at both ends. Why? Well, firstly, similar molecules in the database have their stereochemistry clearly defined, most notably ascorbic acid of which this is a derivative. And secondly, the chiral marks in the connection table in the MOL file indicate that the stereochemistry is defined at both centers. Correction (see comment by WDI below): The chiral marks do not indicate this - they actually say that the stereochem is only defined at one center.

What to do

The reason I'm even looking into this is because I'm trying to figure out how Open Babel should handle these cases. When reading them, should it just assume that the MOL file was following the common convention but issue a warning to flag up the fact that the MOL file sucks? Should it provide an option to read other conventions? Should it avoid writing files that contain inter-chiralcenter stereobonds even if they were in the input, or will that upset users who expect Open Babel to pass wedge/hash bonds through unchanged?

Of course, this could all be avoided if people would just fix their 2D MOL files. With that in mind, here a couple of lines of code to identify such problems (requires dev version of OB 2.3):


Wolf-D. Ihlenfeldt said...

This blob post makes several incorrect assertions - which could have been easily checked prior to posting by a) reading the MDL CTFILE spec, and b) contacting PubChem...

The MOL file has a '3' chirality label on the wedge base atom. This means that there is explicitly no stereochemistry at that atom.

PubChem processing adheres to the 'tip only' convention.

The choice of the wedge locations in the image is balanced between various conventions and factors - here the exo-ring-bond received the highest rating, since people tend to have difficulties interpreting wedges in rings. But there is a factor to principally discourage the placement of wedges to other stereo centers, thus the comparatively low incidence rate for this kind of display.

baoilleach said...

@Wolf: Sorry, my mistake - I misinterpreted the '3' - I'll correct the text.

The main point remains though. The convention adopted should either be clearly described or such inter-chiralcenter bonds should be avoided. This applies to in-house databases as much as publicly available ones.

In the example shown, an exo-ring C-H bond could have been chosen as the wedge location without any problem. In fact, this is what was done for other related structures.

Evan Bolton said...

Not to belabor Wolf's point... but this is also a matter of taste... the "wiggly-wedge" bond is suppressed... and is not visible to the algorithm that places non-wiggly wedges...

If we didn't suppress this during placement, the user wonders why we put the solid wedge in the ring... not realizing there is an undefined stereo center on the exo atom... and thinking the algorithm is messed up for placing wedges. :-)

A no-win scenario... like many problems in cheminformatics, no matter how you bite into the pickle, it is still sour.

baoilleach said...

Maybe the question I should have asked is whether anything other than the "tip only" convention is ever seen in the wild.

If it isn't, then the whole thing is really a non-issue.