Pipeline Overview ================= BenchAudit runs a configurable audit pipeline over molecular property and DTI benchmarks. Execution flow -------------- 1. ``run.py`` loads one or more YAML configs. 2. ``utils.build_loader()`` selects a loader (tabular, TDC, Polaris, or DTI). 3. ``utils.build_analyzer()`` selects a SMILES or DTI analyzer. 4. The analyzer computes hygiene, similarity, conflict, and cliff diagnostics. 5. ``utils.ResultWriter`` persists artifacts to a run directory. 6. Optional baseline benchmarking writes ``performance.json``. Supported modalities -------------------- * ``tabular``: CSV/TSV/Parquet inputs with configurable column mapping * ``tdc``: Therapeutics Data Commons via ``pytdc`` * ``polaris``: Polaris benchmarks via ``polaris-lib`` * ``dti``: tabular DTI data with ligand SMILES + target sequences Core diagnostics ---------------- SMILES / molecular analyses: * split hygiene and contamination * nearest-neighbor similarity summaries * label conflicts on identical cleaned SMILES * activity cliffs on similar molecules with divergent labels DTI-specific additions: * target-sequence overlap and duplication summaries * sequence-alignment nearest-neighbor diagnostics * optional Foldseek structure-level leakage checks Design notes ------------ The code path is intentionally config-driven and dataset-agnostic: * column inference with explicit override hooks * optional cleaning / canonicalization * shared output schema for automated downstream analysis * deterministic output directory resolution for CI and reproducibility See also -------- * :doc:`scientific_scope` for definitions, criteria, and benchmark scope. * :doc:`methods` for implementation-focused methodology details.