Thursday 1 May 2008

Shortened SMILES for URLS - Why a big SMILES need not mean a long face

Rajarshi has put together a RESTful web service that returns a 2D depiction of a molecule (using the CDK) given a SMILES string. However, it turns out that the SMILES has a maximum size of 255 characters (appears to be an OS or mod_python limit). That should be enough, sez you, but for a random sample of 100 PubChem molecules, 9 are greater than the limit (these include explicit hydrogens, I should point out).

Between Rajarshi and myself, we worked out a solution: zip the SMILES and then base64 it (so that it can be used in a URL). Here are the lengths of the original SMILES strings, the bzipped2 strings, and finally the base64ed strings:
280 101 136
316 113 152
432 110 148
282 106 144
320 92 124
372 115 156
452 143 192
326 96 128
268 94 128
In each case, the final string is less than half the size of the original string. The actual strings themselves all begin with the same few letters (don't ask me why) which means that the same RESTful URL can be used to handle both the SMILES strings and their compressed cousins. For example, sildenafil can be viewed at both this url (containing the SMILES) and this url (containing the encoded form).

Notes: (1) Python methods bz2.compress() and base64.urlsafe_b64encode() are used. (2) Bzip2 was used instead of gzip or zip simply because it has an easier to use API. (3) A SMILES string needs to have a certain length before the procedure will actually result in any compression. In the sildenafil example, the encoded string is in fact longer than the original (108 vs. 61)

9 comments:

Andrew Dalke said...

Try zlib instead of gzip. I've found it to be smaller, as you might remember from my essay last year.

Rajarshi said...

The reason for the constancy of the first 4 (actually it's 8 or 9) characters is that bz2 will add a constant header. So the base64 encoded version also has a constant header (i.e., the first N characters), which allows me to identify it as a base64 encoded bz2 stream.

I had initially considered gzip, but it wasn't clear to me how many characters I need to check (sometimes 2 sometimes 4 etc). I think this is because if you use zlib directly, it does not add the header part, but I may be wrong

Egon Willighagen said...

So, what is about the atom count limit with this approach?

Noel O'Boyle said...

@Andrew: Just reread your essay at http://www.dalkescientific.com/writings/diary/archive/2007/07/26/gzip_for_molecular_similarity.html

Sounds like gzip/zlib would be a better choice. We could append "zz" to the start of the string to signify an encoded SMILES. However...

@Egon: ...the overall problem with 255 chars remains. It seems that in Rajarshi's setup, the URLs map onto directories in the file system, and this causes an OS error. One way around this would be to use Django instead. I'm not aware of any size limits on Django URLs.

This is much more work of course, and given that the current system probably works 95% of the time, and there probably aren't a massive number of users, it may not be worth the time/effort trade-off.

@Rajarshi: Any chance of a link to the REST page from the front page of ChemBioGrid? I need to use Google to find it.

Noel O'Boyle said...

The link for the essay is actually this

Anonymous said...

@rajarshi, @baoilleach You could also optionally allow the SMILES in a text/plain entity on the request.

Noel O'Boyle said...

@Jim: POST it, do you mean? I think he wanted to use the REST approach. But it might be good to have it as a fallback option for the user.

Anonymous said...

@boailleach I meant GET. AFAIK it's valid to have a message body on any HTTP message, and I think retain the benefits of conditional GET.

Completely agree that it would be a fallback; whether dealing with HTTP is easier than dealing with zipping and encoding depends on the library you're using.

Andrew Dalke said...

"The GET method means retrieve whatever information (in the form of an entity) is identified by the Request-URI."

I interpret that to mean that anything in the body is ignored, when it comes to determining what to return. Though the headers can affect things.