Pipeline Overview¶

BenchAudit runs a configurable audit pipeline over molecular property and DTI benchmarks.

Execution flow¶

run.py loads one or more YAML configs.
utils.build_loader() selects a loader (tabular, TDC, Polaris, or DTI).
Loaders return standardized split dataframes.
If info.clean_benchmark is enabled, benchmark cleaning removes invalid molecules, label-conflicting samples, and exact contaminants from non-reference splits.
utils.build_analyzer() selects a SMILES or DTI analyzer.
The analyzer computes hygiene, similarity, conflict, and cliff diagnostics.
utils.ResultWriter persists artifacts to a run directory.
Optional baseline benchmarking writes performance.json.

Supported modalities¶

tabular: CSV/TSV/Parquet inputs with configurable column mapping
tdc: Therapeutics Data Commons via pytdc
polaris: Polaris benchmarks via polaris-lib
dti: tabular DTI data with ligand SMILES + target sequences

Core diagnostics¶

SMILES / molecular analyses:

split hygiene and contamination
nearest-neighbor similarity summaries
label conflicts on identical cleaned SMILES
activity cliffs on similar molecules with divergent labels

DTI-specific additions:

target-sequence overlap and duplication summaries
sequence-alignment nearest-neighbor diagnostics
optional Foldseek structure-level leakage checks

Design notes¶

The code path is intentionally config-driven and dataset-agnostic:

column inference with explicit override hooks
optional cleaning / canonicalization
opt-in benchmark curation before analysis and baselines
shared output schema for automated downstream analysis
deterministic output directory resolution for CI and reproducibility

See also¶

Scientific Scope for definitions, criteria, and benchmark scope.
Methods for implementation-focused methodology details.