Friday 27 April 2018

Finding Texas carbons and other unusual valencies

So, having disposed of the phrase "invalid molecule" in the previous post, let's find some :-). Step one is to define what I mean by "invalid molecule": I'm going to use this to mean any molecule that contains an atom that is not on a list of common charge states and valencies. Rather conveniently, I have a table to hand provided by Daniel Lowe which I can use for this purpose. If an element is not in the table, it's fine; but if it is, then it has to be in one of the listed charge states and valencies.

The code is shown below, but first, the results for ChEMBL 23 (you can copy the whole lot and paste into John Mayfield's CDKdepict to view). There are a couple of Texas carbons, oxygen radicals, and TEMPO-like nitrogens that should be neutral. That's not to say that everything found is dodgy - nitrogen monoxide is rumoured to be stable, for example - but they are definitely unusual and warrant further inspection.
F[Al-3](F)(F)(F)(F)F 181624
C[C@H]([C@@H](CO)NC(=O)[C@@H]1CSSC[C@@H](C(=O)N[C@H](C(=O)N[C@@H](C(=O)N[C@H](C(=O)N[C@H](C(=O)N1)C(C)O)CCCCN)CC2=CN=C3=CC=CC=C32)Cc4ccccc4)NC(=O)[C@@H](Cc5ccccc5)N)O 3349005
CC[C@@]1(C[C@@H]2C[C@@]([C]3C(=c4ccccc4=N3)CCN(C2)C1)(c5cc6c(cc5OC)N([C@@H]7[C@]68CCN9[C@H]8[C@@](C=CC9)([C@H]([C@@]7(C(=O)N)O)O)CC)C)C(=O)OC)O.OS(=O)(=O)O 3349007
CC(C)(C)OC(=O)NCCC(=O)N[C@@H](CC1=CN=C2=CC=CC=C21)C(=O)N[C@@H](CCSC)C(=O)N[C@@H](CC(=O)O)C(=O)N[C@@H](Cc3ccccc3)C(=O)N 3348969
CCCCC1(N(C(CO1)(C)C)[O-])CCCCCCCCCCCCCOP(=O)([O])OC2CC[N+](CC2)(C)C 606677
CCCCCCCCCCCC1(N(C(CO1)(C)C)[O-])CCCCCCOP(=O)([O])OC2CC[N+](CC2)(C)C 606678
CCCCCCC1(N(C(CO1)(C)C)[O-])CCCCCCCCCCCOP(=O)([O])OC2CC[N+](CC2)(C)C 606679
CCCCCCCCCCCCCC1(N(C(CO1)(C)C)[O-])CCCCOP(=O)([O])OC2CC[N+](CC2)(C)C 606742
CCCCCCCCC1(N(C(CO1)(C)C)[O-])CCCCCCCCCOP(=O)([O])OC2CC[N+](CC2)(C)C 606743
CC(C)(C)c1cc(cc(c1[O])C(C)(C)C)CCNc2c3c(ncn2)n(cn3)[C@H]4[C@@H]([C@@H]([C@H](O4)CO)O)O 1098520
CC(C)(C)c1cc(cc(c1[O])C(C)(C)C)CCNc2c3c(ncn2)n(cn3)[C@H]4[C@@H]([C@@H]([C@H](O4)C(=O)NC)O)O 1098715
CC(C)(C)c1cc(cc(c1[O])C(C)(C)C)C(=O)NCCc2ccc(cc2)Nc3c4c(ncn3)n(cn4)[C@H]5[C@@H]([C@@H]([C@H](O5)C(=O)NC)O)O 1094994
[N]=O 1200689
CCCCCCC1(N(C(CO1)(C)C)[O-])CCCCCCCCCCCOP(=O)([O])OCC[N+](C)(C)C 606744
c1cc(ccc1C(=N)N)OCCCCCOc2ccc(cc2)C(=N)N.C(CO[S](=O)=O)O.C(CO[S](=O)=O)O 1405011
OP1(=O)OP(=O)(OP(=O)(O1)O)O.[Al] 1433450
C1[C@@H]2[C@@H]([C@H](O1)[C@H]([CH]O2)O)C[C@H]3[C@@H]([C@H]([C@H]([C@H](O3)CO)OS(=O)(=O)O)O[C@@H]4[C@@H]([C@@H]5[C@H]([C@H](O4)CO5)O[C@H]6[C@@H]([C@H]([C@H]([C@H](O6)CO)O)[O])O)OS(=O)(=O)O)O 1525673
c1ccc(cc1)N(c2ccccc2)[N]c3c(cc(cc3[N+](=O)[O-])[N+](=O)[O-])[N+](=O)[O-] 1668332
[B].c1ccc(cc1)[Si](c2ccccc2)(c3ccccc3)OCCN4CC4 1985326
[B].CS(=O)(=O)O[C@H]1CN2C[C@H]([C@H]([C@H]2[C@H]1CO)O)O 1974383
[B]C(=O)NC(CC(C)C)C(=O)NC1c2ccsc2C(=O)C1O.CN(C)C 1974647
C(C(C(=O)O)S)(C(=O)O)S.[As] 1991929
[B]C(=O)NC(CC(C)C)C(=O)NCC(=O)NC1c2ccsc2C(=O)C1O.CN(C)C 1986958
[B].CN(C)C/C(=C(\c1ccc(c(c1)OC)OC)/Cl)/c2ccc(c(c2)OC)OC 2005771
CC[n+]1c2ccccc2sc1/C=C(/C)\C#C/C=C\3/N(c4ccccc4S3)CC.[F-][P+5]([F-])([F-])([F-])([F-])[F-] 1992520
CN(C)c1ccc(c(c1)[O-])N=O.CN(C)c1ccc(c(c1)[O-])N=O.[Si+4].Cl.[Cl-].[Cl-] 2146183
CC#N.C1CC[C@@H]([C@@H](C1)O)[O-].C1CC[C@@H]([C@@H](C1)O)[O-].[Cl-].[Cl-].[Te+4] 2146182
[H+].C1C[O-][Te+4][O-]1.N.[Cl-].[Cl-].[Cl-] 2146259
CCCCC1C[O-][Te+4][O-]1.N.[Cl-].[Cl-].[Cl-] 2146289
CCCCCCC1C[O-][Te+4][O-]1.N.[Cl-].[Cl-].[Cl-] 2146290
[C] 2106049
[S] 2105487
[F-].[F-].[F-].[F-].[F-].[F-].[Si+4] 2310952
CC[C@H](C)[C@@H]([C@@H](CC(=O)N1CCC[C@H]1[C@@H]([C@@H](C)C(=O)N[C@H](C)[C@H](c2ccccc2)O)OC)OC)N(C)C(=O)[C@H](C(C)C)NC(=O)[C@H](C(C)C)N(C)C(=O)OCc3ccc(cc3)NC(=O)[C@H](CCCNC(=O)N)NC(=O)[C@H](C(C)C)NC(=O)CCCCCN4C(=O)C[CH]C4=O 2364667
c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.[Si+4].Cl.Cl 2374292
c1ccc(c(=O)cc1)[O-].c1ccc(c(=O)cc1)[O-].c1ccc(c(=O)cc1)[O-].[Si+4].[Cl-] 2374293
CC(C)(C)C(=O)CC(=O)C(C)(C)C.CC(C)(C)C(=O)CC(=O)C(C)(C)C.[Si+4].Cl.Cl 2374299
c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].[Si+4].Cl 2374305
CN1C=NNC1(=S)c2c(c3c(c(nnc3s2)c4ccccc4)c5ccccc5)O 2299271
[Al+3].[P-3] 2272784
Cc1ccc(c2c1oc-3c(c(=O)c(c(c3n2)C(=O)N[C@H]4[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]5CCCN5C(=O)[C@H](NC4=O)C(C)C)C)C)C(C)C)C)NCCNC6CC([N+](C(C6)(C)C)[O-])(C)C)C)C(=O)N[C@H]7[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]8CCCN8C(=O)[C@H](NC7=O)C(C)C)C)C)C(C)C)C 3228736
Cc1ccc(c2c1oc-3c(c(=O)c(c(c3n2)C(=O)N[C@H]4[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]5CCCN5C(=O)[C@H](NC4=O)C(C)C)C)C)C(C)C)C)NC6CC([N+](C(C6)(C)C)[O-])(C)C)C)C(=O)N[C@H]7[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]8CCCN8C(=O)[C@H](NC7=O)C(C)C)C)C)C(C)C)C 3228735
Cc1ccc(c2c1oc-3c(c(=O)c(c(c3n2)C(=O)N[C@H]4[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]5CCCN5C(=O)[C@H](NC4=O)C(C)C)C)C)C(C)C)C)NCCCNC6CC([N+](C(C6)(C)C)[O-])(C)C)C)C(=O)N[C@H]7[C@H](OC(=O)[C@@H](N(C(=O)CN(C(=O)[C@@H]8CCCN8C(=O)[C@H](NC7=O)C(C)C)C)C)C(C)C)C 3228737
[NH4+].[NH4+].F[Si-2](F)(F)(F)(F)F 3182693
Cc1cn(c(=O)[nH]c1=O)C2C(C(C(O2)C(COC(=O)C)NC(=O)C(CC3=C4=CC=CC=C4N=C3)NC(=O)OC(C)(C)C)OC(=O)C)OC(=O)C 3187332
C1CCN2CC3CC(C2C1)CN4C3CCCC4=O.O[Cl+3]([O-])([O-])[O-] 3186825
F[Si-2](F)(F)(F)(F)F.[Na+].[Na+] 3184182
C[Si](C)O[Si](C)(C)O[Si](C)(C)O[Si](C)C 3184420
CN(C)[N](=NOc1cc(c(cc1[N+](=O)[O-])[N+](=O)[O-])ON=[N+](N2CCN(CC2)C(=O)c3cc(ccc3F)Cc4c5ccccc5c(=O)[nH]n4)[O-])O 3188868
CNc1ccc(cc1)C(=O)Oc2cc(c(cc2C#N)[N+](=O)[O-])ON=[N](N(C)C)O 3187972
C[Si](O[Si](C)(C)C)O[Si](C)(C)C 3189129
c1cc(cc(c1)NC(=O)C(=O)O)C2=NN=N[N]2 3188982
CC(C)N1c2c(c(ncn2)N)C(=C3C=c4cc(ccc4=N3)O)[N]1 3209955
c1ccn2c(c1)c(c(=O)n(c2=O)CCCCN3CCC(CC3)c4c[nH]c5c4F=C(C=C5)F)c6ccc(cc6)F 3397072
CC[C@@]1(C[C@@H]2C[C@@]([C]3C(=c4ccccc4=N3)CCN(C2)C1)(c5cc6c(cc5OC)N([C@@H]7[C@]68CCN9[C@H]8[C@@](C=CC9)([C@H]([C@@]7(C(=O)N)O)O)CC)C)C(=O)OC)O 3545878
c1ccc2c(c1)C(=O)O[Mg]3(O2)Oc4ccccc4C(=O)O3.O.O.O.O 3561635
[H]1C[C@@H](C[C@@H]2[C@]1([C@H]3C[C@H]([C@@]4([C@H](CC[C@@]4([C@@H]3CC2)O)C\5=CC(=O)O/C5=C\c6ccc(cc6)N(C)C)C)O)C)O[C@H]7C[C@@H]([C@@H]([C@H](O7)C)O[C@H]8C[C@@H]([C@@H]([C@H](O8)C)O[C@H]9C[C@@H]([C@@H]([C@H](O9)C)O)O)O)O 3594279
[CH]=CCc1c[nH]c2c1cccc2 3623239
[CH2]c1c[nH]c2c1cccc2 3623240
[CH]=CCc1c[nH]c2c1cc(cc2)F 3623241
c1ccc2c(c1)C(=O)O[Mg]3(O2)Oc4ccccc4C(=O)O3 3580437
C1C[O-][Te+4][O-]1 3558859
CCCCC1C[O-][Te+4][O-]1 3558860
CCCCCCC1C[O-][Te+4][O-]1 3558861
c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.c1ccc(cc1)C(=O)CC(=O)c2ccccc2.[Si+4] 3559378
CC(C)(C)C(=O)CC(=O)C(C)(C)C.CC(C)(C)C(=O)CC(=O)C(C)(C)C.[Si+4] 3559379
c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].c1cc2cccnc2c(c1)[O-].[Si+4] 3559380
CC#N.C1CC[C@@H]([C@@H](C1)O)O.C1CC[C@@H]([C@@H](C1)O)O.[Te+4] 3559382
CN(C)c1ccc(c(c1)O)N=O.CN(C)c1ccc(c(c1)O)N=O.[Si+4] 3559383
c1ccc(c(=O)cc1)O.c1ccc(c(=O)cc1)O.c1ccc(c(=O)cc1)O.[Si+4] 3559385
import sys
import pybel
ob = pybel.ob
ob.obErrorLog.StopLogging()

import multiprocessing as mp

common_valencies = {1: {0: [1], 1: [0]},
 2: {0: [0]},
 3: {0: [1], 1: [0]},
 4: {0: [2], 1: [1], 2: [0]},
 5: {-2: [3], -1: [4], 0: [3], 1: [2], 2: [1]},
 6: {-2: [2], -1: [3], 0: [4], 1: [3], 2: [2]},
 7: {-2: [1], -1: [2], 0: [3, 5], 1: [4], 2: [3]},
 8: {-2: [0], -1: [1], 0: [2], 1: [3, 5]},
 9: {-1: [0], 0: [1], 1: [2], 2: [3, 5]},
 10: {0: [0]},
 11: {-1: [0], 0: [1], 1: [0]},
 12: {0: [2], 2: [0]},
 13: {-2: [3, 5], -1: [4], 0: [3], 1: [2], 2: [1], 3: [0]},
 14: {-2: [2], -1: [3, 5], 0: [4], 1: [3], 2: [2]},
 15: {-2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [4], 2: [3]},
 16: {-2: [0], -1: [1, 3, 5, 7], 0: [2, 4, 6], 1: [3, 5], 2: [4]},
 17: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 18: {0: [0]},
 19: {-1: [0], 0: [1], 1: [0]},
 20: {0: [2], 1: [1], 2: [0]},
 31: {-2: [3, 5], -1: [4], 0: [3], 1: [0], 2: [1], 3: [0]},
 32: {-2: [2, 4, 6], -1: [3, 5], 0: [4], 1: [3], 4: [0]},
 33: {-3: [0], -2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [4], 2: [3]},
 34: {-2: [0], -1: [1, 3, 5, 7], 0: [2, 4, 6], 1: [3, 5], 2: [4]},
 35: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 36: {0: [0, 2]},
 37: {-1: [0], 0: [1], 1: [0]},
 38: {0: [2], 1: [1], 2: [0]},
 49: {-2: [3, 5], -1: [2, 4], 0: [3], 1: [0], 2: [1], 3: [0]},
 50: {-2: [2, 4, 6], -1: [3, 5], 0: [2, 4], 1: [3], 2: [0], 4: [0]},
 51: {-2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [2, 4], 2: [3], 3: [0]},
 52: {-2: [0], -1: [1, 3, 5, 7], 0: [2, 4, 6], 1: [3, 5], 2: [2, 4]},
 53: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 54: {0: [0, 2, 4, 6, 8]},
 55: {-1: [0], 0: [1], 1: [0]},
 56: {0: [2], 1: [1], 2: [0]},
 81: {0: [1, 3]},
 82: {-2: [2, 4, 6], -1: [3, 5], 0: [2, 4], 1: [3], 2: [0]},
 83: {-2: [1, 3, 5, 7], -1: [2, 4, 6], 0: [3, 5], 1: [2, 4], 2: [3], 3: [0]},
 84: {0: [2, 4, 6]},
 85: {-1: [0], 0: [1, 3, 5, 7], 1: [2, 4, 6], 2: [3, 5]},
 86: {0: [0, 2, 4, 6, 8]},
 87: {0: [1], 1: [0]},
 88: {0: [2], 1: [1], 2: [0]}}

def IsAttachedToNitrogen(atom):
    nbr = next(ob.OBAtomAtomIter(atom))
    return nbr.GetAtomicNum() == 7

def HasCommonValence(mol):
    for atom in ob.OBMolAtomIter(mol):
        elem = atom.GetAtomicNum()
        if elem not in common_valencies:
            continue # just skip unusual elements
            # return False # unusual elem
        chg = atom.GetFormalCharge()
        data = common_valencies[elem]
        if chg not in data:
            return False # unusual charge state
        totalbonds = atom.BOSum() + atom.GetImplicitHCount()
        if totalbonds not in data[chg]:
            if not(elem==8 and chg==0 and totalbonds==1 and IsAttachedToNitrogen(atom)): # TEMPO-like
                return False # unusual valence (and not TEMPO-like)
    return True

def calculate(smi):
    mol = pybel.readstring("smi", smi).OBMol
    if not HasCommonValence(mol):
        return smi
    return None

if __name__ == "__main__":
    POOLSIZE = 6 # the number of CPUs
    CHUNKSIZE = 1000
    pool = mp.Pool(POOLSIZE)
    with open("output.txt", "w") as out:
        with open(r"D:\LargeData\ChEMBL\chembl_23.ism", "r") as inp:
            #for result in pool.imap(calculate, inp, CHUNKSIZE):
            for result in pool.imap_unordered(calculate, inp, CHUNKSIZE):
            # for result in map(calculate, inp): # no multiprocessing
                if result:
                    out.write(result)

No comments: