Saturday 5 May 2018

When all you want is a ring

I've just added a method called CopySubstructure() to Open Babel. It's something I've missed for a while, the ability to copy part of a molecule into a new one. The basic usage is where the user gives a list of atoms to copy. When you copy the substructure, all bonds between atoms in the substructure are retained (and only those). By default, hydrogen counts are adjusted to account for missing bonds.

As an example of use, if you copy all atoms that are in rings, you will find all of the ring systems:
import pybel
ob = pybel.ob

atomsToCopy = ob.OBBitVec()
bondsToExclude = ob.OBBitVec()

def CopyRingSystem(orig, copy):
    atomsToCopy.Clear()
    for atom in ob.OBMolAtomIter(orig):
        if atom.IsInRing():
            atomsToCopy.SetBitOn(atom.GetIdx())
    return orig.CopySubstructure(copy, atomsToCopy)
So Clc1ccccc1OC2CCC2 will become c1ccccc1.C2CCC2. If you do this for all of ChEMBL, split on the dots, and collate using the canonical SMILES, you can find the frequencies of all ring systems. Well, almost...

Even for this 'simple' use case, there are some subtleties that may trap the unwary when copying substructures of a molecule. For example, aromaticity is copied as present in the original molecule, and this may or may not be what one wants - it's up to the user to decide, and to mark aromaticity as unperceived if they wish. In this context, it's worth bearing in mind that the Daylight aromaticity model includes contributions from some exocyclic double bonds (and these won't be present with the code above).

Another quirk is that while it may seem that the code above creates SMILES strings for isolated ring systems, that's not quite true. If two ring systems are connected by a bond, then that bond will be retained even where it itself is not in a ring (because all bonds between atoms in the substructure are retained - see the very first paragraph above). For example, c1ccccc1C2CCC2 will stay c1ccccc1C2CCC2.

Which brings us to the second argument to the method, a list of bonds to exclude. To get the behaviour we might expect, all we need to do is add all bonds that are not in a ring to this list of bonds to exclude:
...continued from above
def CopyRingSystem_2(orig, copy):
    atomsToCopy.Clear()
    for atom in ob.OBMolAtomIter(orig):
        if atom.IsInRing():
            atomsToCopy.SetBitOn(atom.GetIdx())
    bondsToExclude.Clear()
    for bond in ob.OBMolBondIter(orig):
        if not bond.IsInRing():
            bondsToExclude.SetBitOn(bond.GetIdx())
    return orig.CopySubstructure(copy, atomsToCopy, bondsToExclude)

Notes:
Here is a main() that could be used with the code above:
if __name__ == "__main__":
    copied = ob.OBMol()
    with open("outputC.smi", "w") as out:
        for mol in pybel.readfile("smi", r"C:\Tools\LargeData\sortedbylength.smi"):
            copied.Clear()
            ok = CopyRingSystem_2(mol.OBMol, copied)
            assert ok
            if copied.NumAtoms() > 0:
                smi = pybel.Molecule(copied).write("smi").rstrip()
                out.write("\n".join(smi.split(".")))
                out.write("\n")

No comments: