Friday 30 December 2016

The clockwisdom of SMARTS

Image credit: Jonathan Cohen (CC-BY-NC)


In earlier posts I discussed/investigated how stereochemistry is represented in SMILES. Here I'm going to try to figure out what I thought would be relatively simple, how to write a SMARTS pattern that matches a chiral molecule and all of its superstructures. For example, consider wanting to search for glucopyranose-containing molecules in a database that also contained other hexopyranose epimers. As background, see the Mischievous SMARTS post by John Mayfield.

As a simple example, let's take the molecule represented by the SMILES F[C@@H](Br)Cl. I want to write a SMARTS pattern that matches this, as well as all superstructures. In this context, given that the halides typically are single valent, such superstructures are replacements of the hydrogen by arbitrary R groups.

Now, you may be aware that every SMILES string is also a valid SMARTS pattern. Unfortunately, it is also true that this is rarely the SMARTS pattern that you want. In this particular case, the original SMILES string, when interpreted as a SMARTS query, requires that the C has exactly 1 hydrogen attached. In other words, it won't match any superstructures (except for the elusive 5-valent carbon).

So let's leave out the H to give F[C@@](Br)Cl. This cannot be read as SMILES (at least not without warnings) since a chiral carbon requires four neighbours, but it is a valid SMARTS pattern. The question is what does it match?

The answer is more subtle than I, at least, expected. It will only match molecules that correspond to the following pseudosmiles, F[C@@](X)(Br)Cl, where X is anything including an implicit H.

Equally relevant is what it won't match. If X is F, thereby losing the chirality, then you are out of luck, but I would consider that a perfectly reasonable superstructure. And following on from this, it also won't match any other cases where the stereo is not defined at the carbon.

So in the end, I have come to the view that the best SMARTS pattern to use is F[C@@?](Br)Cl, which also matches the case where stereo is undefined. Better to cast the net wide and if someone really doesn't want to match those cases it is easy to do a search-and-replace.