Analysis and Baselines

Analysis engine (utils.analysis)

class utils.analysis.AnalyzerConfig(task_type, typ, sim_threshold=0.9, fp_radius=2, fp_nbits=2048, smiles_col=None, label_col=None, id_col=None, label_cols=None, sequence_col=None, target_id_col=None, name=None, unique_sequences_jsonl=None, foldseek_m8_path=None)[source]

Bases: object

Minimal, YAML-friendly config for SMILES analysis.

Parameters:
  • task_type (Literal['classification', 'regression'])

  • typ (Literal['tdc', 'tabular', 'polaris'])

  • sim_threshold (float)

  • fp_radius (int)

  • fp_nbits (int)

  • smiles_col (str | None)

  • label_col (str | None)

  • id_col (str | None)

  • label_cols (List[str] | None)

  • sequence_col (str | None)

  • target_id_col (str | None)

  • name (str | None)

  • unique_sequences_jsonl (str | None)

  • foldseek_m8_path (str | None)

task_type: Literal['classification', 'regression']
typ: Literal['tdc', 'tabular', 'polaris']
sim_threshold: float = 0.9
fp_radius: int = 2
fp_nbits: int = 2048
smiles_col: str | None = None
label_col: str | None = None
id_col: str | None = None
label_cols: List[str] | None = None
sequence_col: str | None = None
target_id_col: str | None = None
name: str | None = None
unique_sequences_jsonl: str | None = None
foldseek_m8_path: str | None = None
class utils.analysis.AnalysisResult(summary: 'Dict[str, Any]', per_record_df: 'pd.DataFrame', conflicts_rows: 'List[Dict[str, Any]]', cliffs_rows: 'List[Dict[str, Any]]', sequence_alignment_rows: 'Optional[List[Dict[str, Any]]]' = None, structure_alignment_rows: 'Optional[List[Dict[str, Any]]]' = None)[source]

Bases: object

Parameters:
  • summary (Dict[str, Any])

  • per_record_df (pandas.DataFrame)

  • conflicts_rows (List[Dict[str, Any]])

  • cliffs_rows (List[Dict[str, Any]])

  • sequence_alignment_rows (List[Dict[str, Any]] | None)

  • structure_alignment_rows (List[Dict[str, Any]] | None)

summary: Dict[str, Any]
per_record_df: pandas.DataFrame
conflicts_rows: List[Dict[str, Any]]
cliffs_rows: List[Dict[str, Any]]
sequence_alignment_rows: List[Dict[str, Any]] | None = None
structure_alignment_rows: List[Dict[str, Any]] | None = None
utils.analysis.morgan_fps(smiles_list, radius, n_bits)[source]

Compute Morgan/ECFP fingerprints. Returns None for invalid SMILES.

Parameters:
  • smiles_list (List[str])

  • radius (int)

  • n_bits (int)

Return type:

List[rdkit.DataStructs.ExplicitBitVect | None]

utils.analysis.scaffold_fps(smiles_list, radius, n_bits)[source]
Parameters:
  • smiles_list (List[str])

  • radius (int)

  • n_bits (int)

Return type:

List[rdkit.DataStructs.ExplicitBitVect | None]

class utils.analysis.StretcherAlignment(score: 'float', identity_pct: 'float', similarity_pct: 'float', length: 'int', gaps_pct: 'float', n_gaps: 'int', aligned_query: 'str', aligned_subject: 'str', query_start: 'int', query_end: 'int', subject_start: 'int', subject_end: 'int')[source]

Bases: object

Parameters:
  • score (float)

  • identity_pct (float)

  • similarity_pct (float)

  • length (int)

  • gaps_pct (float)

  • n_gaps (int)

  • aligned_query (str)

  • aligned_subject (str)

  • query_start (int)

  • query_end (int)

  • subject_start (int)

  • subject_end (int)

score: float
identity_pct: float
similarity_pct: float
length: int
gaps_pct: float
n_gaps: int
aligned_query: str
aligned_subject: str
query_start: int
query_end: int
subject_start: int
subject_end: int
class utils.analysis.PSAStretcherAligner[source]

Bases: object

Thin wrapper around psa.stretcher with caching.

align(query_seq, subject_seq)[source]
Parameters:
  • query_seq (str)

  • subject_seq (str)

Return type:

StretcherAlignment

class utils.analysis.SMILESAnalyzer(cfg, logger=None)[source]

Bases: object

Simplified, modular SMILES analyzer (MoleculeACE-style similarity).

Parameters:
run(splits_raw)[source]
Parameters:

splits_raw (Dict[str, pandas.DataFrame])

Return type:

AnalysisResult

class utils.analysis.DTIAnalyzer(cfg, logger=None)[source]

Bases: object

Drug–target interaction analysis: combines molecular + sequence hygiene.

Parameters:
run(splits_raw)[source]
Parameters:

splits_raw (Dict[str, pandas.DataFrame])

Return type:

AnalysisResult

Baselines (utils.baselines)

class utils.baselines.BaselineParams(seed: 'int' = 0, fp_radius: 'int' = 2, fp_nbits: 'int' = 2048, mlp_hidden: 'tuple' = (256, 128), mlp_max_iter: 'int' = 300, rf_estimators: 'int' = 500, lgbm_estimators: 'int' = 800, lgbm_lr: 'float' = 0.05, lgbm_leaves: 'int' = 31, lgbm_subsample: 'float' = 0.8, lgbm_colsample: 'float' = 0.8)[source]

Bases: object

Parameters:
  • seed (int)

  • fp_radius (int)

  • fp_nbits (int)

  • mlp_hidden (tuple)

  • mlp_max_iter (int)

  • rf_estimators (int)

  • lgbm_estimators (int)

  • lgbm_lr (float)

  • lgbm_leaves (int)

  • lgbm_subsample (float)

  • lgbm_colsample (float)

seed: int = 0
fp_radius: int = 2
fp_nbits: int = 2048
mlp_hidden: tuple = (256, 128)
mlp_max_iter: int = 300
rf_estimators: int = 500
lgbm_estimators: int = 800
lgbm_lr: float = 0.05
lgbm_leaves: int = 31
lgbm_subsample: float = 0.8
lgbm_colsample: float = 0.8
utils.baselines.eval_baselines_generic(cfg, splits)[source]

Train on train only, evaluate on test using standard metrics.

Parameters:
  • cfg (Dict[str, Any])

  • splits (Dict[str, pandas.DataFrame])

Return type:

Dict[str, Any]

utils.baselines.eval_baselines_polaris(cfg)[source]

Train on train only, get predictions for test, and evaluate via Polaris’ API.

Parameters:

cfg (Dict[str, Any])

Return type:

Dict[str, Any]

utils.baselines.run_baselines(cfg, splits=None)[source]

Public entry point. Uses Polaris path when cfg[‘type’]==’polaris’, else generic.

Parameters:
  • cfg (Dict[str, Any])

  • splits (Dict[str, pandas.DataFrame] | None)

Return type:

Dict[str, Any]