Thanks again to George Papadatos, Mark Forster, et al for the invitation to present at the 2016 reboot of MIOSS (Molecular Informatics in Open Source Software). You may remember the earlier one back in 2011.

I gave a presentation on the Open Babel project. While I have been largely quiescent over the past 2 years, there has been much happening besides. Rather than cover features, I tried to present some interesting stats or views of the overall project. For example, the background of some of the current developers, which file formats were recently added, and so forth. A few of my slides discuss industry involvement in the project - this is because the MIOSS workshop is 50% industry and 50% open source developers.

For info on other presentations from the meeting, see the Lanyrd page (I'll update here when the rest of the talks become available) or search #MIOSS2016 on Twitter.

# Noel O'Blog

## Saturday, 21 May 2016

## Tuesday, 19 April 2016

### Static static code analyser

I've just been refactoring/rewriting some C++ code and noticed that I've left some static functions around that are unused. "static" in this context means functions that are only accessible within the same file (it's a good idea to put static in front of every function until the point you need to call it from some other file). However, if nothing in the same file calls them, then they contain dead code. Unfortunately neither cppcheck nor MSVC's static code analyser will flag up static functions that are not called, and so I've written my own Python script to do so. It's fairly basic but worked for me.

*Update (19/04): Doh! Adding --enable=all to cppcheck will find most (but not all) of these cases.*import os import sys import glob def FindStaticFunctions(fname): fns = [] for line in open(fname): tmp = line.strip() if not tmp.startswith("static"): continue if tmp.endswith(";"): continue broken = tmp.split() for i, x in enumerate(broken): if "(" in x: name = x[:x.find("(")].replace("*", "") fns.append(name) break elif i < len(broken)-1 and broken[i+1][0]=='(': fns.append(x.replace("*", "")) break return fns def FindMissingReferences(fname, staticfns): lines = open(fname).readlines() for staticfn in staticfns: refs = 0 for line in lines: if line.find(staticfn) >= 0: refs += 1 if refs <= 1: print "(%s) %s: %d refs" % (os.path.basename(fname), staticfn, refs) if __name__ == "__main__": if len(sys.argv) < 2: sys.exit("You need to specify a directory") dirname = sys.argv[1] if not os.path.isdir(dirname): sys.exit("%s is not a directory name" % dirname) for cppfile in glob.glob(os.path.join(dirname, "*.cpp")): staticfns = FindStaticFunctions(cppfile) FindMissingReferences(cppfile, staticfns)

## Wednesday, 13 January 2016

### Do BEDROC results correlate with AUC?

Following up on a post over at NM where I was perhaps overly harsh on AUC, it is natural to instead consider early recognition metrics such as EF, BEDROC and so forth as potentially better measures of performance in virtual screens. Here I want to pour cold water on the idea that results for AUC and BEDROC are highly correlated and so we might as well just use the more straightfoward of the two (i.e. AUC).

In a 2010 review of current trends by Geppert, Vogt and Bajorath [1], they interpret the "What do we know and when do we know it?" paper by Nicholls [2] as presenting "evidence for a strong correlation between AUC and BEDROC [3], suggesting AUC as a sufficient measure for virtual screening performance." Now, Nicholls doesn't quite say that but I can see that Figure 8 and 9 in that paper could be interpreted in that way. For the Warren study [4] data shown, the Pearson R

So a large AUC tends to imply a large BEDROC (and v.v), and so forth. But the thing is, that's not quite what we are interested in. The real question is whether the relative ordering of methods obtained with AUC are the same as those obtained with BEDROC - if it is, then we might as well use AUC. If not, we'll have to exercise those little grey cells (I've been reading a lot of Agatha Christie recently) and figure out which is best. In other words, it could be that methods A and B both have a high AUC, and so have a high BEDROC, but does A>B in AUC space imply A>B in BEDROC space?

As it happens I have the results to hand (as one does) for AUC (*) and BEDROC(20) for the Riniker and Landrum dataset [5] (88 proteins, 28 fingerprints, 50 repetitions). This is a fingerprint-based virtual screen as opposed to a docking study, but the conclusions should be the same. The 4400 Spearman correlations between the AUC and BEDROC results are shown here as a histogram (bins of 0.05):

The median value of the Spearman correlation is 0.713. This is not that high (remember that the square is 0.51) indicating that while there is a moderate correlate between the orderings obtained by BEDROC and AUC, they are not highly correlated.

One more nail in the AUC's coffin I hope. Grab your hammer and join me!

[1] Current Trends in Ligand-Based Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation. Hanna Geppert, Martin Vogt, and Jürgen Bajorath.

[2] What do we know and when do we know it? Anthony Nicholls.

[3] Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem. Jean-François Truchon and Christopher I. Bayly.

[4] A critical assessment of docking programs and scoring functions. Warren GL, Andrews CW, Capelli AM, Clarke B, LaLonde J, Lambert MH, Lindvall M, Nevins N, Semus SF, Senger S, Tedesco G, Wall ID, Woolven JM, Peishoff CE, Head MS.

[5] Open-source platform to benchmark fingerprints for ligand-based virtual screening. Sereina Riniker and Gregory A. Landrum.

* I'm lazy so I just used the mean rank of the actives instead of integrating ROC curves. You get an identical ordering (see Eq. 12 in BEDROC paper [3] for the relationship). And yes, I did check.

In a 2010 review of current trends by Geppert, Vogt and Bajorath [1], they interpret the "What do we know and when do we know it?" paper by Nicholls [2] as presenting "evidence for a strong correlation between AUC and BEDROC [3], suggesting AUC as a sufficient measure for virtual screening performance." Now, Nicholls doesn't quite say that but I can see that Figure 8 and 9 in that paper could be interpreted in that way. For the Warren study [4] data shown, the Pearson R

^{2}between the two is 0.901. It is unfortunate that the correlation is only shown for the averaged data (Figure 9) - it would be nice to see the Spearman correlation for Figure 8. But either way, it's clear that they are highly correlated.So a large AUC tends to imply a large BEDROC (and v.v), and so forth. But the thing is, that's not quite what we are interested in. The real question is whether the relative ordering of methods obtained with AUC are the same as those obtained with BEDROC - if it is, then we might as well use AUC. If not, we'll have to exercise those little grey cells (I've been reading a lot of Agatha Christie recently) and figure out which is best. In other words, it could be that methods A and B both have a high AUC, and so have a high BEDROC, but does A>B in AUC space imply A>B in BEDROC space?

As it happens I have the results to hand (as one does) for AUC (*) and BEDROC(20) for the Riniker and Landrum dataset [5] (88 proteins, 28 fingerprints, 50 repetitions). This is a fingerprint-based virtual screen as opposed to a docking study, but the conclusions should be the same. The 4400 Spearman correlations between the AUC and BEDROC results are shown here as a histogram (bins of 0.05):

The median value of the Spearman correlation is 0.713. This is not that high (remember that the square is 0.51) indicating that while there is a moderate correlate between the orderings obtained by BEDROC and AUC, they are not highly correlated.

One more nail in the AUC's coffin I hope. Grab your hammer and join me!

**References:**[1] Current Trends in Ligand-Based Virtual Screening: Molecular Representations, Data Mining Methods, New Application Areas, and Performance Evaluation. Hanna Geppert, Martin Vogt, and Jürgen Bajorath.

*J. Chem. Inf. Model.***2010**,*50*, 205–216.[2] What do we know and when do we know it? Anthony Nicholls.

*J. Comput. Aided Mol. Des.***2008**,*22*, 239–255.[3] Evaluating Virtual Screening Methods: Good and Bad Metrics for the “Early Recognition” Problem. Jean-François Truchon and Christopher I. Bayly.

*J. Chem. Inf. Model.***2007**,*47*, 488-508.[4] A critical assessment of docking programs and scoring functions. Warren GL, Andrews CW, Capelli AM, Clarke B, LaLonde J, Lambert MH, Lindvall M, Nevins N, Semus SF, Senger S, Tedesco G, Wall ID, Woolven JM, Peishoff CE, Head MS.

*J. Med. Chem.***2006**,*49*, 5912-5931.[5] Open-source platform to benchmark fingerprints for ligand-based virtual screening. Sereina Riniker and Gregory A. Landrum.

*J. Cheminf.***2013**,*5*, 26.**Footnotes:*** I'm lazy so I just used the mean rank of the actives instead of integrating ROC curves. You get an identical ordering (see Eq. 12 in BEDROC paper [3] for the relationship). And yes, I did check.

## Wednesday, 2 December 2015

### Coin-tossing power

I've been doing some coin tossing. Well actually it's cheminformatics, but the problem reduces to coin tossing. Given that I toss a coin 1000 times, how many heads must I observe to believe that it is a biased coin. My usual ramshackle approach would be to estimate the standard deviation of the random distribution (have my computer do some coin tossing) and then say something about being two sigmas away from the mean. I've done that just to be sure - the sigma is just under 16.

However, the proper way of handling the original question is a power calculation. Typically power calculations are associated with calculating the necessary sample size. Here I have a fixed sample size (n=1000) and I want to figure out, given that the alternative hypothesis is true (biased coin), what number of heads must I observe to be 99.9% sure (power=0.999) that it would be show up as significant at the 0.1% level (alpha=0.001). The y = 0.5 below refers to the proportion of heads expected under the null hypothesis (unbiased coin). Since n=1000, we can use the normal approximation to the binomial distribution.

Both the power and the significance level are conditional probabilities:

import random random.seed(1) import matplotlib.pyplotPMC points out that a slightly easier way of getting the same value would be to use an equation: variance = n(1-p)p where n = 1000 and p is 0.5. The standard deviation is then the square root of this (i.e. sqrt of 250), 15.8.asplt from collections import defaultdict import scipy import scipy.optimize import numpyasnp values = [0, 1] ans = defaultdict(int)forNinrange(100000): tot = 0foriinrange(1000): tot += random.choice(values) ans[tot] += 1 start, end = 350, 650 x = np.arange(start, end) y = []foriinrange(start, end):ifiinans: y.append(ans[i])else:ifi > start: y.append(y[-1])else: y.append(0) n = len(x) mean = 500. sigma = 10.defgaus(x,a,sigma):returna*scipy.exp(-(x-mean)**2/(2*sigma**2)) popt,pcov = scipy.optimize.curve_fit(gaus,x,y,p0=[1.0,sigma]) print popt plt.plot(x,y,'b+:',label='data') plt.plot(x,gaus(x,*popt),'ro:',label='fit') plt.legend() plt.show()

However, the proper way of handling the original question is a power calculation. Typically power calculations are associated with calculating the necessary sample size. Here I have a fixed sample size (n=1000) and I want to figure out, given that the alternative hypothesis is true (biased coin), what number of heads must I observe to be 99.9% sure (power=0.999) that it would be show up as significant at the 0.1% level (alpha=0.001). The y = 0.5 below refers to the proportion of heads expected under the null hypothesis (unbiased coin). Since n=1000, we can use the normal approximation to the binomial distribution.

import math import scipy.stats...and the answer? 642. Thanks to ALC for help with this, though all errors are my own.defqnorm(p):returnscipy.stats.norm.ppf(p)defsolvequadratic(a, b, c):return[(-b+math.sqrt((b**2)-(4*(a*c))))/(2.*a), (-b-math.sqrt((b**2)-(4*(a*c))))/(2.*a)]if__name__ == "__main__": alpha = 0.001 # Significance level gamma = 0.999 # Power qgamma = qnorm(gamma) qalpha = qnorm(1.-(alpha/2.)) p = (qalpha+qgamma)**2 n = 1000 y = 0.5 a = 2*n+p b = (2*p-4*n)*y - 2*p # Note that if y is 0.5, then, b = -2*n-p (i.e. -a) c = -2*p*y + (2*n+p)*y*y print solvequadratic(a, b, c)

**Notes:***(added 04/12/2015)*Both the power and the significance level are conditional probabilities:

- The significance level (alpha) is the probability of rejecting the null hypothesis given that it is true, ie. prob(rejecting null hypothesis | null hypothesis is correct).
- The power (gamma) is the probability of detecting a difference, if indeed the difference does exist, ie. prob(rejecting null hypothesis | null hypothesis is incorrect).

## Thursday, 27 August 2015

### The whole of cheminformatics best practice in 15 minutes

Here's the talk I co-presented with Rajarshi at the recent Boston ACS in a session about best practices in cheminformatics. It was entitled "So I have an SD File...What do I do next?" and went something like this...

## Friday, 10 April 2015

### A reason for optimism - improvements in solar cells

Back in the day, I did some work in the area of dye-sensitised solar cells, and more recently on organic solar cells. So I like to keep an eye on how things are going. Since now and then I hear negative comments about renewable energy - "we're never going to get X% of our energy from it" - I thought I'd post this graph here as it shows that real progress continues to be made and we should be a bit more optimistic (click image for original):

This graph, "courtesy of the National Renewable Energy Laboratory, Golden, CO", contains efficiency measurements of test solar cells as measured in a standardised test by an independent test lab (e.g. the NREL in the US). Of course, there is quite a distance between an efficient test solar cell (a couple of square cm in size) and a viable product for the market, but improving efficiency is important (lifetime, processability and cost are other considerations).

It should also be noted that it may not make much sense to compare efficiencies of different cell types as the price/efficiency ratio may be much different (e.g. semiconductor-grade silicon crystals are more expensive than an organic polymer). Also, efficiency isn't even the full story, e.g. silicon solar cells are top of the pile, but if we all switched to this technology tomorrow, there would likely be a net increase in electricity usage (i.e. when you consider the energy cost of making them). Similarly any technology that relies on rare elements is not going to save the world (but may make a lot of money for some mobile phone makers).

Anyhoo, I like to look at this graph every so often to remind myself that things are improving, and that in the long run, we're going to make it. Or at least as far as efficient solar cells can take us.

For more info on the graph, check out the NREL's photovoltaic page.

This graph, "courtesy of the National Renewable Energy Laboratory, Golden, CO", contains efficiency measurements of test solar cells as measured in a standardised test by an independent test lab (e.g. the NREL in the US). Of course, there is quite a distance between an efficient test solar cell (a couple of square cm in size) and a viable product for the market, but improving efficiency is important (lifetime, processability and cost are other considerations).

It should also be noted that it may not make much sense to compare efficiencies of different cell types as the price/efficiency ratio may be much different (e.g. semiconductor-grade silicon crystals are more expensive than an organic polymer). Also, efficiency isn't even the full story, e.g. silicon solar cells are top of the pile, but if we all switched to this technology tomorrow, there would likely be a net increase in electricity usage (i.e. when you consider the energy cost of making them). Similarly any technology that relies on rare elements is not going to save the world (but may make a lot of money for some mobile phone makers).

Anyhoo, I like to look at this graph every so often to remind myself that things are improving, and that in the long run, we're going to make it. Or at least as far as efficient solar cells can take us.

For more info on the graph, check out the NREL's photovoltaic page.

## Tuesday, 7 April 2015

### See you at EuroSciPyDataCon?

I've been checking up the various Python conferences this year. Using www.pycon.org as a starting point, I see that for me (based in Cambridge, UK) the possibilities include:

While it would be nice to swan around Bilbao and Berlin, I'll probably aim to attend or speak at EuroSciPy which will be in my backyard. I'm kind of reluctant to speak though; I've never been to a conference that wasn't a straight-up scientific meeting so I'm not quite sure what the audience would expect. Who's interested in chemistry and Python that's not already using chemistry and Python? So I'm not quite sure about that.

Leave a comment if you're going to be there too!

- PyData Berlin: Berlin (surprise!) May 2015
- EuroPython: Bilbao July 2015
- EuroSciPy: Cambridge August 2015
- PyCon UK: Coventry Sept 2015

While it would be nice to swan around Bilbao and Berlin, I'll probably aim to attend or speak at EuroSciPy which will be in my backyard. I'm kind of reluctant to speak though; I've never been to a conference that wasn't a straight-up scientific meeting so I'm not quite sure what the audience would expect. Who's interested in chemistry and Python that's not already using chemistry and Python? So I'm not quite sure about that.

Leave a comment if you're going to be there too!

Subscribe to:
Posts (Atom)