Scientific Scope¶

BenchAudit is designed for scientific auditing of molecular property and drug-target interaction (DTI) benchmarks. The focus is dataset quality, similarity structure, and leakage risk rather than deployment workflows.

Research Questions¶

The implemented analyses target the following questions:

How much exact overlap exists between training/validation and test data?
How similar are held-out compounds to training compounds?
Do identical molecules receive inconsistent labels across splits?
How often do highly similar molecules show strong label disagreement (activity cliffs)?
For DTI datasets, do targets leak across splits at sequence or structure level?

Benchmark Families¶

BenchAudit supports four modalities:

tabular: local CSV/TSV/Parquet benchmark files.
tdc: Therapeutics Data Commons datasets via pytdc.
polaris: Polaris benchmarks via polaris-lib.
dti: ligand-target datasets with SMILES and amino-acid sequences.

The PDBbind v2016 and v2020 DTI regression splits included in the example configs are fixed splits from Koh et al. (Nature Machine Intelligence, 2024; https://doi.org/10.1038/s42256-024-00847-1) and the associated PSICHIC repository (https://github.com/huankoh/PSICHIC).

Primary Outputs¶

Each run produces standardized artifacts intended for analysis and reproducibility:

summary.json: top-level hygiene, similarity, conflict, and cliff statistics.
records.csv: row-level standardized records used by analysis.
conflicts.jsonl: detailed conflict events.
cliffs.jsonl: detailed activity-cliff events.
sequence_alignments.jsonl / structure_alignments.jsonl (DTI when available).
performance.json when baseline benchmarking is enabled.

Terminology and Criteria¶

BenchAudit uses explicit criteria:

Duplicate: repeated cleaned SMILES (or repeated normalized target sequence in DTI).
Contamination: shared entities across train/valid and test.
Similar pair: pair that passes the configured consensus similarity threshold.
Conflict: classification labels differ for identical cleaned SMILES (or for DTI cross-split pair checks), while regression conflicts use a 3-sigma threshold estimated from train/valid labels.
Activity cliff: similar molecules with divergent labels under task-specific rules.

The similarity consensus combines molecular fingerprint similarity, scaffold fingerprint similarity, and normalized SMILES string similarity.