Wednesday 27 October 2010

Release of Open Babel 2.3

As announced by Geoff on the Open Babel mailing lists:
I am very happy to finally announce the release of Open Babel 2.3.0, the next major release of the open source chemistry toolbox.

Open Babel has been downloaded over 160,000 times and is used in over 40 projects.

This release represents a major update and should be a stable upgrade, strongly recommended for all users of Open Babel. Highlights include a completely rewritten stereochemistry engine, Spectrophore descriptor generation, 2D depiction, improved 3D coordinate generation, conformer searching, and more. Many formats are improved or added, including CIF, PDBQT, SVG, and more. Improved developer API and scripting support and many, many bug fixes are also included.

What's new? See the full release notes.

See the new user guide.

See the updated developer documentation.

To download, see:
http://sourceforge.net/projects/openbabel/files/

For more information, see the project website.

I would like to personally thank a few people for extra effort in improving this release. In alphabetical order Chris Morley, Noel O'Boyle, Tim Vandermeersch, and Silicos NV, which donated their Spectrophore(TM) framework to Open Babel 2.3.

This is a community project and we couldn't have made this release without you. Many thanks to all the contributors to Open Babel including those of you who submitted feedback, bug reports, and code.

Cheers,
-Geoff

---
Prof. Geoffrey Hutchison
Department of Chemistry
University of Pittsburgh
tel: (412) 648-0492
email: geoffh@pitt.edu
web: http://hutchison.chem.pitt.edu/

Friday 8 October 2010

Measuring information loss in file format conversion Part III

In this (final) post on the topic (see Parts I, II), I will look at the error rate going from SDF->Canonical SMILES (using Open Babel) versus SDF->SMI->Canonical SMILES. Note that unlike InChI, there is no standard 'canonical SMILES' (although there has been some talk about this in the context of OpenSMILES) - each toolkit has its own way of creating it.
C:\>obabel Conformers_00000001.sdf -osmi -O sdf_to_smi.txt
18084 molecules converted

C:\>obabel Conformers_00000001.sdf -ocan -O sdf_to_can.txt
==============================
*** Open Babel Error in CalcCanonicalLabels
maximum time exceeded...
==============================
*** Open Babel Error in CalcCanonicalLabels
maximum time exceeded...
18084 molecules converted

C:\>obabel -ismi sdf_to_smi.txt -ocan -O smi_to_can.txt
==============================
*** Open Babel Error in CalcCanonicalLabels
maximum time exceeded...
==============================
*** Open Babel Error in CalcCanonicalLabels
maximum time exceeded...
18084 molecules converted

C:\>diff sdf_to_can.txt smi_to_can.txt
136c136
< c12=NCCN=c1ncnc2 167
---
> C12=NCCN=C1NCNC2 167
5716c5716
< N12CC[C@@H](CC1)CC2 7527
---
> N12CC[C@H](CC1)CC2 7527
6928c6928
< c12c3c(cc4c1c1c(nn2)c2c(cc1cc4)cccc2)cccc3 9107
---
> c12c3c(cc4c1c1c([nH][nH]2)c2c(cc1cc4)cccc2)cccc3 9107
16377c16377
< c12c(c(c[nH]1)C[C@@H]1N3CC[C@H](C1)CC3)cccc2 21918
---
> c12c(c(c[nH]1)C[C@@H]1N3CC[C@@H](C1)CC3)cccc2 21918
17517c17517
< c\1(=c/2\[n+](=O)cccc2)/n(cccc1)[O-] 23699
---
> C1(C2[N+](=O)CCCC2)N(CCCC1)[O-] 23699

C:\>

So, a failure rate of five out of 18084. 7527 and 21918 seem to be the same problem (same substructure involved) which is related to graph symmetry, while 167, 9107 and 23699 are kekulization issues. (Update 11/10/2010: 167 and 23699 fixed.)

Credits: The main Open Babel developers are Geoff Hutchison, Chris Morley, Tim Vandermeersch, Craig James and myself. Code contributions from many more.

Thursday 7 October 2010

Measuring information loss in file format conversion Part II

In a comment on the previous post, Richard Hall asked about the error rate going from SDF->SMI->InChI. A very good question. Clearly, this tests the fidelity of reading and writing to and from SMILES. But there's a second less obvious effect...

But first, the results.

For the initial set of 18053 molecules, we find 108 disagreements (0.6%) with the InChIs obtained by converting using Open Babel straight from SDF->InChI. Of these, 25 have an error in the molecular formula in the InChI. These are straightforward bugs in Open Babel in determining the correct number of implicit hydrogens when reading some SMILES (Update 08/10/2010: Now fixed).

The others are more interesting disagreements: when converting from SDF -> InChI, the InChI library itself gets to decide which are the stereocenters*; when converting from SMI -> InChI, the InChI library needs to accept what Open Babel tells it. In other words, disagreements arise when the internal stereochemistry models in the two libraries disagree.

I took a look at three which appeared to disagree in different ways.

[CID 15550] Going through SMI leads to loss of stereo at a double bond in a ring system
From SDF: InChI=1S/C8H12/c1-2-4-6-8-7-5-3-1/h1-4H,5-8H2/b3-1-,4-2-
From SMI: InChI=1S/C8H12/c1-2-4-6-8-7-5-3-1/h1-4H,5-8H2

Whoops, it's those pesky double bonds in ring systems as discussed in the previous post. Might be time to look into this. The ring in question is an 8-membered ring. Is it possible to have a trans bond in an 8-membered ring?

[CID 17567] Going through SMI leads to loss of stereo at a double bond
From SDF: InChI=1S/C4H9N/c1-4(2)3-5/h3-5H,1-2H3/b5-3+
From SMI: InChI=1S/C4H9N/c1-4(2)3-5/h3-5H,1-2H3

The double bond in question is a [H]N=C bond. Open Babel doesn't think this can be a cis/trans bond; InChI thinks it can. Anyone actually know?

[CID 15456] Going through SMI leads to loss of stereo at two tetrahedral centers.
From SDF: InChI=1S/C11H22N3O3P/c1-6-17-9(15)12-18(16,13-7-10(13,2)3)14-8-11(14,4)5/h6-8H2,1-5H3,(H,12,15,16)/t13-,14-/m0/s1
From SMI: InChI=1S/C11H22N3O3P/c1-6-17-9(15)12-18(16,13-7-10(13,2)3)14-8-11(14,4)5/h6-8H2,1-5H3,(H,12,15,16)

The tetrahedral centers in question are both sp3 nitrogens where two of the (three) bonds to the nitrogen are part of the same ring. Again, there is a disagreement between Open Babel and InChI on whether such nitrogens can be tetrahedral centers.

The good news about these results is that we're almost down to the level where the only disagreements we see are disagreements on stereocenters rather than plain buggy bugs.

The bad news is that it's not clear what to do about these disagreements. Setting aside the "Open Babel is wrong - no, InChI is wrong!" discussion, another cheminformatics library will produce different InChIs yet again depending on how it defines stereochemical centers. If we could all agree on what constitutes a reasonable stereocenter there wouldn't be any problem. Alternatively, we could just follow whatever InChI says is a stereocenter even if we don't agree...?

* I apologise for use of the American spelling throughout. It's a symptom of preparing a paper for an ACS journal. Normal service will resume shortly.

Tuesday 5 October 2010

Measuring information loss in file format conversion

If you've followed my posts on parsing stereochemistry in SMILES, you'll have realised that every conversion between different chemical formats has the very real possibility of losing or confusing information. There are several ways to identify such problems. One way is to compare results of a particular conversion to an independent standard.

Here I'll calculate the error rate of conversion from SDF to InChI using Open Babel, compared to doing the same conversion using the official InChI binary. Note that what we are actually calculating is the error rate of conversion from SDF -> Open Babel's internal chemical model -> InChI; it's not that Open Babel just hands the InChI code the raw SDF file.

This test is using the OB 2.3 development code. The test file is the first entry in PubChem3D. This contains 18084 3D structures.

First, run the InChI binary (I'm using Windows):
inchi-1.exe Conformers_00000001.sdf /AuxNone 2> errors.txt

Next, convert from SDF to InChI with obabel.exe (InChI format options described here):
obabel Conformers_00000001.sdf -oinchi -xw -O ob_results.txt

Clean up the InChI output (I have Cygwin installed on Windows):
grep "^InChI=" Conformers_00000001.sdf.txt > official_results.txt

Finally, compare the results:
C:\> diff official_results.txt ob_results.txt

C:\>

Too easy huh? Let's try the first 10 files in PubChem 3D instead: 166735 molecules.

Ah...now we have something. I found 15 disagreements on the InChI. Hmmmm...but 13 of these involve molecules with isotopes of Br...one quick bug fix later (SVN r4134), I have 2 errors left: molecules 144031 and 144132. These both have multiple double bonds in ring systems, and I think there may be a difference in opinion between Open Babel and InChI over the cutoff for the size of ring in which the stereochemistry of double bonds should be considered...but that's a problem for another day.

So how does the current release compare to this? Not so well, not so well at all. We started reimplementing stereochemistry in Open Babel about 1.5 years ago, and it's only now we're getting such good performance. In short, if stereochemistry in InChI is important for your application, you should wait for the 2.3 release (or run the development code).