Noel O'Blog: April 2018

Friday 27 April 2018

Finding Texas carbons and other unusual valencies

So, having disposed of the phrase "invalid molecule" in the previous post, let's find some :-). Step one is to define what I mean by "invalid molecule": I'm going to use this to mean any molecule that contains an atom that is not on a list of common charge states and valencies. Rather conveniently, I have a table to hand provided by Daniel Lowe which I can use for this purpose. If an element is not in the table, it's fine; but if it is, then it has to be in one of the listed charge states and valencies.

The code is shown below, but first, the results for ChEMBL 23 (you can copy the whole lot and paste into John Mayfield's CDKdepict to view). There are a couple of Texas carbons, oxygen radicals, and TEMPO-like nitrogens that should be neutral. That's not to say that everything found is dodgy - nitrogen monoxide is rumoured to be stable, for example - but they are definitely unusual and warrant further inspection.

F[Al-3](F)(F)(F)(F)F 181624
C[C@H]([C@@H](CO)NC(=O)[C@@H]1CSSC[C@@H](C(=O)N[C@H](C(=O)N[C@@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N1)C(C)O)CCCCN)CC2=CN=C3=CC=CC=C32)Cc4ccccc4)NC(=O)[C@@H](Cc5ccccc5)N)O 3349005
CC[C@@]1(C[C@@H]2C[C@@]([C]3C(=c4ccccc4=N3)CCN(C2)C1)(c5cc6c(cc5OC)N([C@@H]7[C@]68CCN9[C@H]8[C@@](C=CC9)([C@H]([C@@]7(C(=O)N)O)O)CC)C)C(=O)OC)O.OS(=O)(=O)O 3349007
CC(C)(C)OC(=O)NCCC(=O)N[C@@H](CC1=CN=C2=CC=CC=C21)C(=O)N[C@@H](CCSC)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc3ccccc3)C(=O)N 3348969
CCCCC1(N(C(CO1)(C)C)[O-])CCCCCCCCCCCCCOP(=O)([O])OC2CC[N+](CC2)(C)C 606677
CCCCCCCCCCCC1(N(C(CO1)(C)C)[O-])CCCCCCOP(=O)([O])OC2CC[N+](CC2)(C)C 606678
CCCCCCC1(N(C(CO1)(C)C)[O-])CCCCCCCCCCCOP(=O)([O])OC2CC[N+](CC2)(C)C 606679
CCCCCCCCCCCCCC1(N(C(CO1)(C)C)[O-])CCCCOP(=O)([O])OC2CC[N+](CC2)(C)C 606742
CCCCCCCCC1(N(C(CO1)(C)C)[O-])CCCCCCCCCOP(=O)([O])OC2CC[N+](CC2)(C)C 606743
CC(C)(C)c1cc(cc(c1[O])C(C)(C)C)CCNc2c3c(ncn2)n(cn3)[C@H]4[C@@H]([C@@H]([C@H](O4)CO)O)O 1098520
CC(C)(C)c1cc(cc(c1[O])C(C)(C)C)CCNc2c3c(ncn2)n(cn3)[C@H]4[C@@H]([C@@H]([C@H](O4)C(=O)NC)O)O 1098715
CC(C)(C)c1cc(cc(c1[O])C(C)(C)C)C(=O)NCCc2ccc(cc2)Nc3c4c(ncn3)n(cn4)[C@H]5[C@@H]([C@@H]([C@H](O5)C(=O)NC)O)O 1094994
[N]=O 1200689
CCCCCCC1(N(C(CO1)(C)C)[O-])CCCCCCCCCCCOP(=O)([O])OCC[N+](C)(C)C 606744
c1cc(ccc1C(=N)N)OCCCCCOc2ccc(cc2)C(=N)N.C(CO[S](=O)=O)O.C(CO[S](=O)=O)O 1405011
OP1(=O)OP(=O)(OP(=O)(O1)O)O.[Al] 1433450
C1[C@@H]2[C@@H]([C@H](O1)[C@H]([CH]O2)O)C[C@H]3[C@@H]([C@H]([C@H]([C@H](O3)CO)OS(=O)(=O)O)O[C@@H]4[C@@H]([C@@H]5[C@H]([C@H](O4)CO5)O[C@H]6[C@@H]([C@H]([C@H]([C@H](O6)CO)O)[O])O)OS(=O)(=O)O)O 1525673
c1ccc(cc1)N(c2ccccc2)[N]c3c(cc(cc3[N+](=O)[O-])[N+](=O)[O-])[N+](=O)[O-] 1668332
[B].c1ccc(cc1)[Si](c2ccccc2)(c3ccccc3)OCCN4CC4 1985326
[B].CS(=O)(=O)O[C@H]1CN2C[C@H]([C@H]([C@H]2[C@H]1CO)O)O 1974383
[B]C(=O)NC(CC(C)C)C(=O)NC1c2ccsc2C(=O)C1O.CN(C)C 1974647
C(C(C(=O)O)S)(C(=O)O)S.[As] 1991929
[B]C(=O)NC(CC(C)C)C(=O)NCC(=O)NC1c2ccsc2C(=O)C1O.CN(C)C 1986958
[B].CN(C)C/C(=C(\c1ccc(c(c1)OC)OC)/Cl)/c2ccc(c(c2)OC)OC 2005771
CC[n+]1c2ccccc2sc1/C=C(/C)\C#C/C=C\3/N(c4ccccc4S3)CC.[F-][P+5]([F-])([F-])([F-])([F-])[F-] 1992520
CN(C)c1ccc(c(c1)[O-])N=O.CN(C)c1ccc(c(c1)[O-])N=O.[Si+4].Cl.[Cl-].[Cl-] 2146183
CC#N.C1CC[C@@H]([C@@H](C1)O)[O-].C1CC[C@@H]([C@@H](C1)O)[O-].[Cl-].[Cl-].[Te+4] 2146182
[H+].C1C[O-][Te+4][O-]1.N.[Cl-].[Cl-].[Cl-] 2146259
CCCCC1C[O-][Te+4][O-]1.N.[Cl-].[Cl-].[Cl-] 2146289
CCCCCCC1C[O-][Te+4][O-]1.N.[Cl-].[Cl-].[Cl-] 2146290
[C] 2106049
[S] 2105487
[F-].[F-].[F-].[F-].[F-].[F-].[Si+4] 2310952
CC[C@H](C)[C@@H]([C@@H](CC(=O)N1CCC[C@H]1[C@@H]([C@@H](C)C(=O)N[C@H](C)[C@H](c2ccccc2)O)OC)OC)N(C)C(=O)[C@H](C(C)C)NC(=O)[C@H](C(C)C)N(C)C(=O)OCc3ccc(cc3)NC(=O)[C@H](CCCNC(=O)N)NC(=O)[C@H](C(C)C)NC(=O)CCCCCN4C(=O)C[CH]C4=O 2364667
c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.[Si+4].Cl.Cl 2374292
c1ccc(c(=O)cc1)[O-].c1ccc(c(=O)cc1)[O-].c1ccc(c(=O)cc1)[O-].[Si+4].[Cl-] 2374293
CC(C)(C)C(=O)CC(=O)C(C)(C)C.CC(C)(C)C(=O)CC(=O)C(C)(C)C.[Si+4].Cl.Cl 2374299
c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].[Si+4].Cl 2374305
CN1C=NNC1(=S)c2c(c3c(c(nnc3s2)c4ccccc4)c5ccccc5)O 2299271
[Al+3].[P-3] 2272784
Cc1ccc(c2c1oc-3c(c(=O)c(c(c3n2)C(=O)N[C@H]4[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]5CCCN5C(=O)[C@H](NC4=O)C(C)C)C)C)C(C)C)C)NCCNC6CC([N+](C(C6)(C)C)[O-])(C)C)C)C(=O)N[C@H]7[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]8CCCN8C(=O)[C@H](NC7=O)C(C)C)C)C)C(C)C)C 3228736
Cc1ccc(c2c1oc-3c(c(=O)c(c(c3n2)C(=O)N[C@H]4[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]5CCCN5C(=O)[C@H](NC4=O)C(C)C)C)C)C(C)C)C)NC6CC([N+](C(C6)(C)C)[O-])(C)C)C)C(=O)N[C@H]7[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]8CCCN8C(=O)[C@H](NC7=O)C(C)C)C)C)C(C)C)C 3228735
Cc1ccc(c2c1oc-3c(c(=O)c(c(c3n2)C(=O)N[C@H]4[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]5CCCN5C(=O)[C@H](NC4=O)C(C)C)C)C)C(C)C)C)NCCCNC6CC([N+](C(C6)(C)C)[O-])(C)C)C)C(=O)N[C@H]7[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]8CCCN8C(=O)[C@H](NC7=O)C(C)C)C)C)C(C)C)C 3228737
[NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F 3182693
Cc1cn(c(=O)[nH]c1=O)C2C(C(C(O2)C(COC(=O)C)NC(=O)C(CC3=C4=CC=CC=C4N=C3)NC(=O)OC(C)(C)C)OC(=O)C)OC(=O)C 3187332
C1CCN2CC3CC(C2C1)CN4C3CCCC4=O.O[Cl+3]([O-])([O-])[O-] 3186825
F[Si-2](F)(F)(F)(F)F.[Na+].[Na+] 3184182
C[Si](C)O[Si](C)(C)O[Si](C)(C)O[Si](C)C 3184420
CN(C)[N](=NOc1cc(c(cc1[N+](=O)[O-])[N+](=O)[O-])ON=[N+](N2CCN(CC2)C(=O)c3cc(ccc3F)Cc4c5ccccc5c(=O)[nH]n4)[O-])O 3188868
CNc1ccc(cc1)C(=O)Oc2cc(c(cc2C#N)[N+](=O)[O-])ON=[N](N(C)C)O 3187972
C[Si](O[Si](C)(C)C)O[Si](C)(C)C 3189129
c1cc(cc(c1)NC(=O)C(=O)O)C2=NN=N[N]2 3188982
CC(C)N1c2c(c(ncn2)N)C(=C3C=c4cc(ccc4=N3)O)[N]1 3209955
c1ccn2c(c1)c(c(=O)n(c2=O)CCCCN3CCC(CC3)c4c[nH]c5c4F=C(C=C5)F)c6ccc(cc6)F 3397072
CC[C@@]1(C[C@@H]2C[C@@]([C]3C(=c4ccccc4=N3)CCN(C2)C1)(c5cc6c(cc5OC)N([C@@H]7[C@]68CCN9[C@H]8[C@@](C=CC9)([C@H]([C@@]7(C(=O)N)O)O)CC)C)C(=O)OC)O 3545878
c1ccc2c(c1)C(=O)O[Mg]3(O2)Oc4ccccc4C(=O)O3.O.O.O.O 3561635
[H]1C[C@@H](C[C@@H]2[C@]1([C@H]3C[C@H]([C@@]4([C@H](CC[C@@]4([C@@H]3CC2)O)C\5=CC(=O)O/C5=C\c6ccc(cc6)N(C)C)C)O)C)O[C@H]7C[C@@H]([C@@H]([C@H](O7)C)O[C@H]8C[C@@H]([C@@H]([C@H](O8)C)O[C@H]9C[C@@H]([C@@H]([C@H](O9)C)O)O)O)O 3594279
[CH]=CCc1c[nH]c2c1cccc2 3623239
[CH2]c1c[nH]c2c1cccc2 3623240
[CH]=CCc1c[nH]c2c1cc(cc2)F 3623241
c1ccc2c(c1)C(=O)O[Mg]3(O2)Oc4ccccc4C(=O)O3 3580437
C1C[O-][Te+4][O-]1 3558859
CCCCC1C[O-][Te+4][O-]1 3558860
CCCCCCC1C[O-][Te+4][O-]1 3558861
c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.[Si+4] 3559378
CC(C)(C)C(=O)CC(=O)C(C)(C)C.CC(C)(C)C(=O)CC(=O)C(C)(C)C.[Si+4] 3559379
c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].[Si+4] 3559380
CC#N.C1CC[C@@H]([C@@H](C1)O)O.C1CC[C@@H]([C@@H](C1)O)O.[Te+4] 3559382
CN(C)c1ccc(c(c1)O)N=O.CN(C)c1ccc(c(c1)O)N=O.[Si+4] 3559383
c1ccc(c(=O)cc1)O.c1ccc(c(=O)cc1)O.c1ccc(c(=O)cc1)O.[Si+4] 3559385

import sys
import pybel
ob = pybel.ob
ob.obErrorLog.StopLogging()

import multiprocessing as mp

common_valencies = {1: {0: [1], 1: [0]},
 2: {0: [0]},
 3: {0: [1], 1: [0]},
 4: {0: [2], 1: [1], 2: [0]},
 5: {-2: [3], -1: [4], 0: [3], 1: [2], 2: [1]},
 6: {-2: [2], -1: [3], 0: [4], 1: [3], 2: [2]},
 7: {-2: [1], -1: [2], 0: [3, 5], 1: [4], 2: [3]},
 8: {-2: [0], -1: [1], 0: [2], 1: [3, 5]},
 9: {-1: [0], 0: [1], 1: [2], 2: [3, 5]},
 10: {0: [0]},
 11: {-1: [0], 0: [1], 1: [0]},
 12: {0: [2], 2: [0]},
 13: {-2: [3, 5], -1: [4], 0: [3], 1: [2], 2: [1], 3: [0]},
 14: {-2: [2], -1: [3, 5], 0: [4], 1: [3], 2: [2]},
 15: {-2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [4], 2: [3]},
 16: {-2: [0], -1: [1, 3, 5, 7], 0: [2, 4, 6], 1: [3, 5], 2: [4]},
 17: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 18: {0: [0]},
 19: {-1: [0], 0: [1], 1: [0]},
 20: {0: [2], 1: [1], 2: [0]},
 31: {-2: [3, 5], -1: [4], 0: [3], 1: [0], 2: [1], 3: [0]},
 32: {-2: [2, 4, 6], -1: [3, 5], 0: [4], 1: [3], 4: [0]},
 33: {-3: [0], -2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [4], 2: [3]},
 34: {-2: [0], -1: [1, 3, 5, 7], 0: [2, 4, 6], 1: [3, 5], 2: [4]},
 35: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 36: {0: [0, 2]},
 37: {-1: [0], 0: [1], 1: [0]},
 38: {0: [2], 1: [1], 2: [0]},
 49: {-2: [3, 5], -1: [2, 4], 0: [3], 1: [0], 2: [1], 3: [0]},
 50: {-2: [2, 4, 6], -1: [3, 5], 0: [2, 4], 1: [3], 2: [0], 4: [0]},
 51: {-2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [2, 4], 2: [3], 3: [0]},
 52: {-2: [0], -1: [1, 3, 5, 7], 0: [2, 4, 6], 1: [3, 5], 2: [2, 4]},
 53: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 54: {0: [0, 2, 4, 6, 8]},
 55: {-1: [0], 0: [1], 1: [0]},
 56: {0: [2], 1: [1], 2: [0]},
 81: {0: [1, 3]},
 82: {-2: [2, 4, 6], -1: [3, 5], 0: [2, 4], 1: [3], 2: [0]},
 83: {-2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [2, 4], 2: [3], 3: [0]},
 84: {0: [2, 4, 6]},
 85: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 86: {0: [0, 2, 4, 6, 8]},
 87: {0: [1], 1: [0]},
 88: {0: [2], 1: [1], 2: [0]}}

def IsAttachedToNitrogen(atom):
    nbr = next(ob.OBAtomAtomIter(atom))
    return nbr.GetAtomicNum() == 7

def HasCommonValence(mol):
    for atom in ob.OBMolAtomIter(mol):
        elem = atom.GetAtomicNum()
        if elem not in common_valencies:
            continue # just skip unusual elements
            # return False # unusual elem
        chg = atom.GetFormalCharge()
        data = common_valencies[elem]
        if chg not in data:
            return False # unusual charge state
        totalbonds = atom.BOSum() + atom.GetImplicitHCount()
        if totalbonds not in data[chg]:
            if not(elem==8 and chg==0 and totalbonds==1 and IsAttachedToNitrogen(atom)): # TEMPO-like
                return False # unusual valence (and not TEMPO-like)
    return True

def calculate(smi):
    mol = pybel.readstring("smi", smi).OBMol
    if not HasCommonValence(mol):
        return smi
    return None

if __name__ == "__main__":
    POOLSIZE = 6 # the number of CPUs
    CHUNKSIZE = 1000
    pool = mp.Pool(POOLSIZE)
    with open("output.txt", "w") as out:
        with open(r"D:\LargeData\ChEMBL\chembl_23.ism", "r") as inp:
            #for result in pool.imap(calculate, inp, CHUNKSIZE):
            for result in pool.imap_unordered(calculate, inp, CHUNKSIZE):
            # for result in map(calculate, inp): # no multiprocessing
                if result:
                    out.write(result)

Wednesday 25 April 2018

Cheminformatics for deep learners: Valid SMILES and valid molecules

Recent papers on generating SMILES strings via deep-learning approaches have awoken the cheminformatics pedant in me when I see references to "valid SMILES" and/or "valid molecules". Do these terms make any sense? Do they mean what the authors think they mean?

As a concrete example here's a line from Rafael Gómez-Bombarelli et al (a ground-breaking paper in the field, which appeared on Arxiv back in 2016): "This increased the accuracy of generated SMILES strings, which resulted in higher fractions of valid SMILES strings."

"Valid SMILES", eh? Here's a nice example of a valid SMILES: C#C(#C)(#C)(#C)(#C), a carbon connected via triple bonds to five other carbons. It's a syntactically valid SMILES string that is happily read by many toolkits; for example, we can use Open Babel to calculate the molecular weight of the corresponding molecule::

> obabel -:"C#C(#C)(#C)(#C)(#C)" -otxt --append mw
77.1039

An invalid SMILES might be missing a parenthesis or ring closure, or begin with a bond symbol, or contain the element Zz or any number of things. The problem here is that the terms "in/valid SMILES" are used by the authors above with some other meaning, presumably related to the likelihood of the existence of the corresponding molecule. As I hope I have demonstrated, the validity of a SMILES string has nothing to do with whether the corresponding molecule might exist or not.

What I'm really talking about here is the difference between syntax and semantics: the meaning of the SMILES string versus its symbolic construction. Elsewhere Rafael Gómez-Bombarelli et al refers to "the fragility of [SMILES] syntax (opening and closing cycles and branches, allowed valences, etc.)". They should have stopped at "branches" - the "allowed valences" (whatever this may mean - it's undefined) is nothing to do with the syntax of a SMILES string.

So maybe "valid molecule" is a better term for that? Or at least that seems to be what people think. But "valid molecule" is an even more nebulous term - what is a valid molecule? One that's not a radical? One that might exist at standard temperatures and pressures without decomposing in a millisecond? Who knows. I think that what people actually mean is that the atoms in the molecule are all in common valences and charge states, or perhaps they just mean that it is rejected by RDKit (which might be for a number of reasons unrelated to the validity of the molecule, e.g. kekulization failure). If that's what they mean, they should just say that.

So please, a bit more clarity and a bit less woolly language. Think of the ~~children~~ cheminformaticians.

Tuesday 24 April 2018

Running CUDA samples with Visual Studio 2017

I've been installing the CUDA drivers on a Windows 10 box with Visual Studio 2017, and trying to get the CUDA samples to compile. Although solution files are provided for VS2017 (among other VSs), you will get something similar to the following error when you attempt to compile:

error MSB8036: The Windows SDK version 10.0.15063.0 was not found. Install the required version of Windows SDK or change the SDK version in the project property pages or by right-clicking the solution and selecting "Retarget solution".

Right-clicking on the solution and retargeting gets you a bit further:

fatal error C1189: #error:  -- unsupported Microsoft Visual Studio version! Only the versions 2012, 2013, 2015 and 2017 are supported!

...which is funny, because I am using VS2017. If you dig into it, it's the specific version that's the problem, and there doesn't seem to be an easy fix.

However, a nice feature (finally!) of VS2017 is that you can optionally install other compiler toolchains. If you rerun your VS2017 Installer, and find the Modify option (under More), you will see a whole bunch of extra features you can install under "Individual components". The one of interest here is "VC++ 2015.3 v140 toolset for desktop". Once installed, you can instead open the Visual Studio 2015 solutions, and the good news is that these successfully compile.

Saturday 14 April 2018

Generating multiple SMILES

While sometimes presented as a negative, the ability to generate multiple SMILES strings for the same molecule can also be a positive, particularly when you want to avoid bias (e.g. machine learning from SMILES - see here and here) or check that an algorithm is atom-order invariant.

Here are two different ways to generate multiple SMILES strings for the same molecule using Open Babel (without introducing dot disconnections). As an example, let's consider my favourite molecule: c1ccccc1C(=O)Cl.

The first approach is to use canonical SMILES...except that the canonical labels are generated randomly. You can do this directly at the commandline (see "obabel -Hsmi" for more info):

>obabel -:c1ccccc1C(=O)Cl -osmi -xC
O=C(c1ccccc1)Cl

Each time you do it, a different random SMILES string will be generated [1], up to a total of 16 variants (in this case):

C(=O)(Cl)c1ccccc1
C(=O)(c1ccccc1)Cl
ClC(=O)c1ccccc1
O=C(Cl)c1ccccc1
O=C(c1ccccc1)Cl
c1(C(=O)Cl)ccccc1
c1(ccccc1)C(=O)Cl
c1c(C(=O)Cl)cccc1
c1c(cccc1)C(=O)Cl
c1cc(C(=O)Cl)ccc1
c1cc(ccc1)C(=O)Cl
c1ccc(C(=O)Cl)cc1
c1ccc(cc1)C(=O)Cl
c1cccc(C(=O)Cl)c1
c1cccc(c1)C(=O)Cl
c1ccccc1C(=O)Cl

We can generate even more variants by specifying the output order directly - this overrides some decisions that are usually left to the SMILES writer and allows us, for example, to force single bonds to be followed before double bonds:

>obabel -:c1ccccc1C(=O)Cl -osmi -xo 1-2-3-4-5-6-7-9-8
c1ccccc1C(Cl)=O

Using this approach, 32 variants can be generated:

C(=O)(Cl)c1ccccc1
C(=O)(c1ccccc1)Cl
C(Cl)(=O)c1ccccc1
C(Cl)(c1ccccc1)=O
C(c1ccccc1)(=O)Cl
C(c1ccccc1)(Cl)=O
ClC(=O)c1ccccc1
ClC(c1ccccc1)=O
O=C(Cl)c1ccccc1
O=C(c1ccccc1)Cl
c1(C(=O)Cl)ccccc1
c1(C(Cl)=O)ccccc1
c1(ccccc1)C(=O)Cl
c1(ccccc1)C(Cl)=O
c1c(C(=O)Cl)cccc1
c1c(C(Cl)=O)cccc1
c1c(cccc1)C(=O)Cl
c1c(cccc1)C(Cl)=O
c1cc(C(=O)Cl)ccc1
c1cc(C(Cl)=O)ccc1
c1cc(ccc1)C(=O)Cl
c1cc(ccc1)C(Cl)=O
c1ccc(C(=O)Cl)cc1
c1ccc(C(Cl)=O)cc1
c1ccc(cc1)C(=O)Cl
c1ccc(cc1)C(Cl)=O
c1cccc(C(=O)Cl)c1
c1cccc(C(Cl)=O)c1
c1cccc(c1)C(=O)Cl
c1cccc(c1)C(Cl)=O
c1ccccc1C(=O)Cl
c1ccccc1C(Cl)=O

In summary, these approaches allow you to generate all possible SMILES strings consistent with a depth-first ordering of atoms [2], starting from different points and choosing different routes at each branch point. For machine learning, I'd imagine that the first approach would be preferred as the second approach will generate SMILES strings that will contain substrings that would never be observed normally (in Open Babel SMILES).

Python code

import random
random.seed(1)
import pybel

def randomlabels(mol, N):
    ans = set()
    for i in range(N):
        ans.add(mol.write("smi", opt={"C":True}).rstrip())
    return sorted(list(ans))

def randomorder(mol, N):
    ans = set()
    numatoms = mol.OBMol.NumAtoms()
    for i in range(N):
        idxs = list(range(1, numatoms+1))
        random.shuffle(idxs)
        optval = "-".join(str(x) for x in idxs)
        ans.add(mol.write("smi", opt={"o": optval}).rstrip())
    return sorted(list(ans))

if __name__ == "__main__":
    mol = pybel.readstring("smi", "c1ccccc1C(=O)Cl")

    print("Random canonical labels")
    randomsmis = randomlabels(mol, 500)
    print(len(randomsmis))
    for smi in randomsmis:
        print(smi)
    print()
    print("Random output order")
    randomsmis = randomorder(mol, 500)
    print(len(randomsmis))
    for smi in randomsmis:
        print(smi)
    print()

Notes:
1. An alternative (but slower) way to generate these same SMILES would be to shuffle the atoms in the OBMol and then write it out as a SMILES string.
2. If dot disconnections are tolerated, then see Andrew Dalke's approach.

	Blog	Comm
Me
Rich
Rajarshi
Egon