Thursday 10 July 2008

Pipeline Python - Generate a workflow

Workflow packages such as Pipeline Pilot, Taverna and KNIME allow the user to graphically create a pipeline to process molecular data. A downside of these packages is that the units of the workflow, the nodes, process data sequentially. That is, no data gets to Node 2 until Node 1 has finished processing all of it. Correction (thanks Egon): The previous line is plain incorrect. Both KNIME and Taverna2, at least, pass on partially processed data as soon as it's available.

Wouldn't it be nicer if they worked more like Unix pipes, that is, as soon as some data comes out of Node 1 it gets passed onto the next Node and so on. This would have three advantages: (1) you get the first result quicker, (2) you don't use up loads of memory storing all of the intermediate results, (3) you can run things in parallel, e.g. Node 2 could start processing the data from Node 1 immediately, perhaps even on a different computer.

Luckily, there is a neat feature in Python called a generator that allows you to create a pipeline that processes data in parallel. Generators are functions that return a sequence of values. However, unlike just returning a list of values, they only calculate and return the next item in the sequence when requested. One reason this is useful is because the sequence of items could be very large, or even infinite in length. (For a more serious introduction, see David Beazley's talk at PyCon'08, which is the inspiration for this blog post.)

Let's create a pipeline for processing an SDF file that has three nodes: (1) a filter node that looks for the word "ZINC00" in the title of the molecule, (2) a filter node for Tanimoto similarity to a target molecule, (3) an output node that returns the molecule title. (The full program is presented at the end of this post.)
# Pipeline Python!
pipeline = createpipeline((titlematches, "ZINC00"),
(similarto, targetmol, 0.50),
(moltotitle,))

# Create an input source
dataset = pybel.readfile("sdf", inputfile)

# Feed the pipeline    
results = pipeline(dataset)
The variable 'results' is a generator, so nothing actually happens until we request the values returned by the generator...
# Print out each answer as it comes
for title in results:
print title
The titles of the molecules found will appear on the screen one by one as they are found, just like in a Unix pipe. Note how easy it is to combine nodes into a pipeline.

Here's the full program:
import re
import os
import itertools

# from cinfony import pybel
import pybel

def createpipeline(*filters):
def pipeline(dataset):
piped_data = dataset
for filter in filters:
piped_data = filter[0](piped_data, *filter[1:])
return piped_data
return pipeline

def titlematches(mols, patt):
p = re.compile(patt)
return (mol for mol in mols if p.search(mol.title))

def similarto(mols, target, cutoff=0.7):
target_fp = target.calcfp()
return (mol for mol in mols if (mol.calcfp() | target_fp) >= cutoff)

def moltotitle(mols):
return (mol.title for mol in mols)

if __name__ == "__main__":
inputfile = os.path.join("..", "face-off", "timing", "3_p0.0.sdf")
dataset = pybel.readfile("sdf", inputfile)
findtargetmol = createpipeline((titlematches, "ZINC00002647"),)
targetmol = findtargetmol(dataset).next()

# Pipeline Python!
pipeline = createpipeline((titlematches, "ZINC00"),
(similarto, targetmol, 0.50),
(moltotitle,))

# Create an input source
dataset = pybel.readfile("sdf", inputfile)

# Feed the pipeline    
results = pipeline(dataset)

# Print out each answer as it comes through the pipeline    
for title in results:
print title


So, if in future someone tells you that Python generators can be used to make a workflow, don't say "I never node that".

Image: Pipeline by Travis S. (CC BY-NC 2.0)

7 comments:

Egon Willighagen said...

Noel, I always said "workflow environments" are basically scripting languages.

You mention: "A downside of these packages is that the units of the workflow, the nodes, process data sequentially."

This is incorrect for the KNIME, and for Taverna2 too; no idea about PP. But both KNIME and T2 pass on molecules, as soon as they come out of the first node, and will not wait until all data is processed.

Noel O'Boyle said...

I stand corrected. This was my impression after playing with KNIME and after discussing Taverna with Christoph. What's all the business with the red/green lights then in KNIME?

I'll put in an update above.

Rich Apodaca said...

Noel and Egon, what kind of scripting language support do Taverna and KNIME have? Could I create protocols in those environments in Ruby or Python, for example?

Doing a quick search found nothing encouraging.

Noel O'Boyle said...

I know that KNIME has a Jython node donated by Tripos. Presumably JRuby would be the way to go? Note that, in case you are misled, KNIME is not open source although it's web site says it is.

Perhaps others can comment on Taverna...

The Travelling Bard said...

There is no scripting support in Taverna at the moment (except for Beanshell). However, there is nothing to stop someone writing a processor which uses JRuby or Jython etc. Not sure about Perl but you might be able to get away with using the "Execute Command Line app" processor to execute perl with the script? There is also the JDK6 javax.script stuff but Taverna uses JDK5 (we need to support legacy users on eg older Macs) so we can't use that yet.

Unknown said...

Noel wrote: "after discussing Taverna with Christoph".
We even talked about Taverna in a Taverna. Isn't that ironic ...

Sorry for being random ...

Anonymous said...

IIRC even Pipeline Pilot passes on molecules (and it's FAST, or at least used to be).

The part about PP that I really dug was the ability to take existing scripts (whether they were in Perl, in the CHARMM scripting language or whatever) and converting them into workflows. Would be great if Taverna had the same feature