Pipeline Overview¶
BenchAudit runs a configurable audit pipeline over molecular property and DTI benchmarks.
Execution flow¶
run.pyloads one or more YAML configs.utils.build_loader()selects a loader (tabular, TDC, Polaris, or DTI).Loaders return standardized split dataframes.
If
info.clean_benchmarkis enabled, benchmark cleaning removes invalid molecules, label-conflicting samples, and exact contaminants from non-reference splits.utils.build_analyzer()selects a SMILES or DTI analyzer.The analyzer computes hygiene, similarity, conflict, and cliff diagnostics.
utils.ResultWriterpersists artifacts to a run directory.Optional baseline benchmarking writes
performance.json.
Supported modalities¶
tabular: CSV/TSV/Parquet inputs with configurable column mappingtdc: Therapeutics Data Commons viapytdcpolaris: Polaris benchmarks viapolaris-libdti: tabular DTI data with ligand SMILES + target sequences
Core diagnostics¶
SMILES / molecular analyses:
split hygiene and contamination
nearest-neighbor similarity summaries
label conflicts on identical cleaned SMILES
activity cliffs on similar molecules with divergent labels
DTI-specific additions:
target-sequence overlap and duplication summaries
sequence-alignment nearest-neighbor diagnostics
optional Foldseek structure-level leakage checks
Design notes¶
The code path is intentionally config-driven and dataset-agnostic:
column inference with explicit override hooks
optional cleaning / canonicalization
opt-in benchmark curation before analysis and baselines
shared output schema for automated downstream analysis
deterministic output directory resolution for CI and reproducibility
See also¶
Scientific Scope for definitions, criteria, and benchmark scope.
Methods for implementation-focused methodology details.