Thursday 10 October 2013

QM up for testing - The Quantum Chemistry Speed Test

History of UK speed enforcementEver wondered which quantum chemistry package is fastest? No? Well you're not alone - I can't find anywhere on the webs a comparison of the speeds of different QM packages. Enter the Quantum Chemistry Speed Test...

Over a series of weeks (weeks which may be spaced months or years apart depending on the ebb and flow of life), I will carry out the same calculation using a variety of packages. The calculation will be a geo-opt of a small size organic molecule on a single CPU and without any attempt to tune.

And you can play along at home. I'll be making all of the input and output files available for your viewing pleasure. Commercial software is unlikely to feature in this comparison (as I don't have access to any) but don't let that stop you. Note that for the usual reason, you should avoid publishing Gaussian timings.

The reason I'm interested in this is that it seems that the focus of many QM packages these days is towards carrying out massively-parallel super accurate calculations. But what I'd really like (and I think most users would be with me in this) is faster speeds for standard calculations. For example, in a project with Geoff Hutchison a few years back I carried out 1000s of single-CPU calculations using Gaussian on a 8-cpu per node cluster. If I had run these in parallel it would have been much slower (see Amdahl's Law) and those CPU hours would not have stretched so far.

Maybe if there were more of these speed tests, it would encourage developers to bring more performance to single-CPU calculations. Well, probably not, but it'll be fun to find out.

Image credit: Paul Townsend on Flickr

10 comments:

Unknown said...

Please include a calculation which does the following (it is possible to do all these in one job)

1. optimisation
2. frequencies (with vibrational circular dichroism optional)
3. NMR Shift calculation.

My example of such a calculation is at thiis digital repository.

Noel O'Boyle said...

I have received other suggestions on things I should check (vis a vis memory usage, compilers and so forth).

All of these are great ideas, but increase the amount of work involved. Since I've been thinking about this for quite some time but have always put it off, I'm going to keep the barrier as low as possible so that I actually follow through.

I do hope however that others use this test as a jumping-off point for their own comparisons. A better job can certainly be done, but I'm going to keep it simple.

Anonymous said...

The last time I checked, Gaussian, Inc. still has a clause in their license which prohibits publishing benchmarking results. This may not be the case anymore, but I know they have historically tried to prevent this from happening.

jle said...

It might be useful to develop a number of standard tests (optimization, frequencies, …) that should be expected to work in the various programs. This will create a baseline and (hopefully) encourage the authors to take them and tune them for their codes. It would also be useful if the authors or others could do a bit of profiling as well to indicate where the bottlenecks are limiting overall performance. This would suggest future work.

Similar to this, performance with various compilers/compiler options would also be useful. When I last did this in '07 I saw 20-40% gains switching from GCC to the Intel compilers. I did not investigate hand-tuned libraries (from compiler vendors or on-line), but they seem to have an impact in QM and should be used. I assume the disk vs. ssd vs. in-code methods argument is a settled matter, along with disk striping/partition schemes.

Finally, it would be quite interesting to see whether the codes have adapted to the shift from vector to cache architectures over the last 10-20 years. I guess this is also a test of the compilers' ability to generate code to take advantage of the change. It would be useful to know whether it still pays to hand-tune compiler options or whether they're smart enough to handle "just compile everything with -O3".

Maybe some of the commercial vendors might want to get involved for marketing reasons?

Noel O'Boyle said...

@Anonymous: I refer to this problem with Gaussian in the blog post.

@jle: The standard tests is a great idea. It's what I think Henry was suggesting also.

There's definitely an academic paper in here for someone willing to do this comparison properly, who has the resources and time to investigate all aspects of the optimisation matrix. As I said, I'm going to be skimming off a couple of blog posts and even there the effort required is an investment of several hours per program.

Regarding commercial software, if someone contacts me and gives me no-strings-attached access, I'll be happy to put it through its paces.

Geoff Hutchison said...

I know several commercial vendors who would give you trial licenses. NB, I'm not affiliated with any of these companies, but:

Q-Chem:
http://www.q-chem.com/qchem-website/demo4.html

ADF:
https://www.scm.com/trial

Turbmole:
http://www.cosmologic.de/index.php?cosName=main_qChemistry

In any case, I think this is a great idea and hopefully a fairly successful first go will convince other vendors to participate.

Alexander Genaev said...

I have a bit tests of some DFT-programs and GAMESS for comparision. The results is on Russian forum but the conclusion table is in English.

Noel O'Boyle said...

Nice.

Alexander, that's exactly the sort of information I am interested in. That's the first time I've seen a list of timing comparisons.

Jeff Hammond said...

If you're going to test single core performance, you should run one job on every core at the same time. When you run a job on one core but let the others idle, you are monopolizing the memory controller and overestimating the throughput potential.

A more reasonable test is to run one job per node and allow the code to use as many cores as it can. This is far more representative of high throughput usage.

Furthermore, some processors cannot fully utilize a core when running in serial. For example, Blue Gene/Q and Intel Xeon Phi require 2-4 threads per core to saturate instruction throughput. Do you wish to penalize such architectures or does your definition of "single core" allow for multithreaded use of a single physical core via the four virtual cores (which is how the OS sees SMT/HT)?

Finally, your methodology excludes codes like MADNESS that literally cannot run in serial because it spawns Pthreads for task scheduling. I suppose one could hack the code to pin all the threads to a single core, but that's abusing the code and leads to inaccurate results.

Noel O'Boyle said...

Good points all, but see newer posts for the conclusion to the series.