Thursday 29 March 2012

Reduce, recycle and reuse bond closure symbols?

Time for a new poll. I'd like to know what you think about reusing bond closure symbols in SMILES strings. Simply put, when writing SMILES there is a choice between reusing the bond closure symbols or using a new number each time. That is, the following molecule with two rings could be written as:
C1CC1c1ccccc1 or
C1CC1c2ccccc2

Here are the pros and cons of choosing not to reuse bond closure symbols (anything missing?):

Pro:
  • Easier to count up the number of rings
  • Easier for a human to find corresponding bond openings and closures
  • Easier to implement (but not a lot)
  • Reuse of bond symbols can cause confusion

Con:
  • Use of % notation for two-digit bond closures can cause confusion (people are not so familiar with it)
  • Leads to syntax like C%1145, meaning that bonds 11, 4 and 5 close here (not very intuitive)
  • Limits bond closures to 99 (but more than 99 is unlikely)
  • Not as concise as alternative (for molecules with at least 10 bond closures)

See if you can answer this poll before checking what does software A do, or standard B suggest. What do you think should be the standard approach?

Update (17/04/2012): Poll results: 8 for reuse, 3 against

7 comments:

RM said...

My opinions:

For molecules with less than 9 bond closures: never reuse

For molecules with more than 9 bond closures, but no more than 9 closures are interweaved/intermixed in any one group: reuse, but with the change over happening between groups (e.g. you might get 1-5 and then 1-6). Bonus points if there's a large linear linker under the changeover.

Cycle-of-cycle molecules where everything is intermixed due only to a few large macrocycles: Use without reuse the low numbers for the macrocycle, and then reuse the remaining ones for the disconnected subgroups, as above.

Large hairy molecules where everything is intermixed and interconnected without an easy way of seperating them into sub-groups: Do whatever works. You're unlikely to understand the structure without some SMILES -> Figure program anyway, so does it really matter?

Noel O'Boyle said...

Great comment.

The key question is whether adopting a split strategy (i.e. different behaviour in different circumstances) is a good idea. I would feel that it is better to adopt a single strategy in all cases as it makes it easier for the user to deduce the behaviour from a small number of test examples; otherwise they'd have to read the manual...and no-one does that!

Andrew Dalke said...

I'm a reuse person, but I think it's partially because I get to use a heap to maintain the list of next-available-item, and show off my mad data structure skillz.

The fastest, if there is a small number of rings, is of course to not reuse. The complexity comes when you need to reuse. The heuristics for those (rare) cases become more complicated than maintaining the heap in the first place.

There's another (minor) con with reuse: I think reusing the same ring digit on the same atom is confusing, as in C1CCCC11CCCC1 .

Noel O'Boyle said...

@Andrew: Feel free to heapify the implementation in OB :-) (from Line 2503)

Nice example with the two digits on one atom...[five minutes later]...actually that's a really nice corner case. OB seems to do as you suggest and open up a new ring digit. I need to look into exactly how it does that...

Orion said...

I beleive that SMILES should be a good balance of human-reable and machine-readable. If we were only interested in the latter, probably this wouldn't even be worth discussing (and there are many other choices, besides). As such, I agree with Andrew that reuse is a good thing (and certainly that same atom reuse is evil!). Still wish it was scoped within parens, though. Would make composing substructures much more predictable. (i.e. C1C(C1CC1)CC1 and C1C(C2CC2)CC1 should mean the same thing, but they don't).

Noel O'Boyle said...

@Andrew, @Orion: Regarding same atom reuse, I see that OB avoids this (whether by accident or design I don't know) by listing the ring openings first, and then the ring closures. Seems counterintuitive, but then again, it does avoid this problem.

@Orion: Interesting idea and I think I've heard you mention it before. I wonder would it be implementable.

Axel D. said...

I usually don't reuse when manually entering a SMILES notation. But my current approach in generating unique SMILES and CurlySMILES notations applies reuse - without reusing the same ring digit on the same ring atom, but in such a way that each polycyclic subsystem of a molecule starts with ring digit 1.