Scientific Scope ================ BenchAudit is designed for scientific auditing of molecular property and drug-target interaction (DTI) benchmarks. The focus is dataset quality, similarity structure, and leakage risk rather than deployment workflows. Research Questions ------------------ The implemented analyses target the following questions: * How much exact overlap exists between training/validation and test data? * How similar are held-out compounds to training compounds? * Do identical molecules receive inconsistent labels across splits? * How often do highly similar molecules show strong label disagreement (activity cliffs)? * For DTI datasets, do targets leak across splits at sequence or structure level? Benchmark Families ------------------ BenchAudit supports four modalities: * ``tabular``: local CSV/TSV/Parquet benchmark files. * ``tdc``: Therapeutics Data Commons datasets via ``pytdc``. * ``polaris``: Polaris benchmarks via ``polaris-lib``. * ``dti``: ligand-target datasets with SMILES and amino-acid sequences. Primary Outputs --------------- Each run produces standardized artifacts intended for analysis and reproducibility: * ``summary.json``: top-level hygiene, similarity, conflict, and cliff statistics. * ``records.csv``: row-level standardized records used by analysis. * ``conflicts.jsonl``: detailed conflict events. * ``cliffs.jsonl``: detailed activity-cliff events. * ``sequence_alignments.jsonl`` / ``structure_alignments.jsonl`` (DTI when available). * ``performance.json`` when baseline benchmarking is enabled. Terminology and Criteria ------------------------ BenchAudit uses explicit criteria: * Duplicate: repeated cleaned SMILES (or repeated normalized target sequence in DTI). * Contamination: shared entities across train/valid and test. * Similar pair: pair that passes the configured consensus similarity threshold. * Conflict: classification labels differ for identical cleaned SMILES (or for DTI cross-split pair checks), while regression conflicts use a 3-sigma threshold estimated from train/valid labels. * Activity cliff: similar molecules with divergent labels under task-specific rules. The similarity consensus combines molecular fingerprint similarity, scaffold fingerprint similarity, and normalized SMILES string similarity.