« Virtual Screening Methods | Main | Where do we currently stand with cheminformatics-driven medicinal chemistry? »

October 04, 2006


John Irwin

Barry, we're already all way too busy, but this is a noble cause that would surely benefit us all. Running a community-based docking comparison is a sociological problem as much as a scientific one: to nucleate a community supported effort that will set the ground rules, run the tests, and interpret the results, in a way that everyone can agree is fair. A wiki-based forum would allow for peer review ofs every aspect of the comparison, from compiling the data, to how the programs are run, to how the results are judged. Everyone can be heard, and all reasonable objections aired, offering an outcome that will be useful to all of us. One downside: unmoderated, such an exercise could quickly degenerate into chaos, so firm moderation is a must. An interesting experiment!

Joerg Kurt Wegner

Dear Barry,

1. there is absolutely no question about the need of proper benchmarking. To be really competitive I agree that it is necessary to make the data and maybe clear technical workflow descriptions available.

2. I would even go a step further than just opening a 'docking' challenge. The mentioned missing 'statistical significance' is my biggest worry and the question is how can we most efficiently move forward in optimizing things? Docking is just a summation of several steps and maybe too many steps at the same time? So, I would rather recommend to split all parts of the docking process into bits and pieces. Then identify those parts with the highest failure risk and focus on them. The process can be at least chopped in
2.1. atom typing
2.3. ligand preparation (ionic forms, tautomers, ...)
2.2. ligand conformer generation
2.3. protein preparation (protonation, residue orientation, ...)
2.4. ligand placement (top-down, bottom-up, fragment based, group based, ...)
2.5. energy calculation (force field type, grid type, algorithm, ...)
2.6. constraint handling (global and local optimization strategy? process to escape local minima?)
2.7. scoring (single-objective, multi-objective, consensus, ...)

3. I would like to identify the best program on the market for each of those steps, since I do not believe that there is one single program that is equally good on all 'targets', this would contradict the no-free-lunch theorem.

4. Then the next question for me is, even if I know that, do software suppliers support pipelining single expert modules? If not, why? What can be done to change that?

5. Finally, if this pipelining of expert modules would hypothetically exist, does there then a method exist to predict which modules should be combined for which target? If not, what is needed to get this kind of prediction method?

Very kind regards, Joerg Kurt Wegner

Christoph Helma

Dear Barry,

I think that the availability of high-quality public datasets is crucial
for the comparison of existing algorithms/programs and for the
identification of methodological problems (which may lead to new
developments that provide real improvements). Facilities to
comment/discuss models and (maybe even more important) individual
predictions would certainly help with the analysis of current shortcomings
and the development of new ideas and algorithms.

Speaking from a (Q)SAR perspective (I am not a docking expert) I would
keep in mind that there is always an (intentional or unintentional)
temptation to overfit a particular testset by tuning parameters until
the model gives good results just by chance. Keeping activity values
secret would help in this respect, but it prevents the analysis of poor
predictions. A pragmatic solution could provide multiple testsets
(having test data for 40 targets would go exactly into this direction)
and to think carefully about procedures, that ensure that none of the
test set information has been used for model development (maybe more
important for (Q)SAR than for docking techniques).

It is also important to remember that validation results are only valid
for the validation dataset and cannot be generalized to real world
applications (e.g. "our in-house library", "drug-like molecules", the
"chemical universe") unless you have drawn a representative sample.
What can be generalized are the results within the applicability domain
(AD) of the model (a forthcoming paper will provide some empirical
evidence). It is therefore important to provide correct AD definitions
for the involved algorithms (I suspect that most of the algorithms for
the docking steps mentioned by Joerg have limited applicability domains)
and to consider only predictions within the applicability domain for
validation purposes. If you are interested in some concrete examples you
can visit the validation pages at
http://www.predictive-toxicology.org/lazar/ (comments are very welcome).

Finally I would suggest not to reinvent the wheel, but to use and/or
collaborate with existing resources like PubChem, DSSTox, ChemDB, ...

Best regards,

Jean-Claude Bradley

Your post on community approaches to docking is of great interest to us. We are carrying out an open source/open notebook science project involving the synthesis of diketopiperazines as new anti-malarial agents. We have started to use docking software to plan our next synthetic targets. However, because our expertise does not lie in docking we would appreciate feedback from the docking community as we make this work public. Here is where we stand:

Sebastian Rohrer

Hi all,

the topic is indeed of great relevance.
Although everybody reports great enrichment factors for his tools, it is clear that without standardized test datasets it is impossible to make an objective comparison between various VS approaches.

A community based project for compiling standardized benchmark datasets would be of great value and I would be happy to contribute.

However with all the focus on docking, I want to remind you that there have been major successes in VS using ligand based approaches. So let's not forget ligand based VS in these efforts!


Peter Willett

Dear Barry,

I have no doubt that such a community-wide comparison would facilitate the development of the field by highlighting approaches of general applicability. I base this statement on my experience of a domain where the shared test-bed approach has really spurred R&D, specifically in the field of text search engines where the annual Text Retrieval Conference (TREC) organised by the NIST has for long played a central role in the development of the subject (see http://trec.nist.gov/). Each year, TREC provides a large dataset on which participants in the competition can carry out searches for pre-defined queries for which the relevant documents (i.e., the true positives) are kept from the participants. The searches are then evaluated using common performance metrics and there’s an annual conference to discuss the results. A similar common dataset/evaluation procedure was used for several years – the MUC conferences – in the natural language community (see http://www.cs.mu.oz.au/acl/C/C96/C96-1079.pdf), and common datasets play an important role in QSAR and ligand-based virtual screening (the steroid and MDDR datasets, although in these cases the positives are known).

Peter Willett


Are you there?
These are very inportant.
Want to check your kids or employees.
Alco testers for home and office - http://www.xlar.com/alcohol-testers-and-breathalyzer.html
May be it helps you

The comments to this entry are closed.

Communities of Practice

eCheminfo Chairs, Presenters & Instructors