Analysis and Baselines¶
Analysis engine (utils.analysis)¶
- class utils.analysis.AnalyzerConfig(task_type, typ, sim_threshold=0.9, fp_radius=2, fp_nbits=2048, smiles_col=None, label_col=None, id_col=None, label_cols=None, sequence_col=None, target_id_col=None, name=None, unique_sequences_jsonl=None, foldseek_m8_path=None)[source]¶
Bases:
objectMinimal, YAML-friendly config for SMILES analysis.
- Parameters:
task_type (Literal['classification', 'regression'])
typ (Literal['tdc', 'tabular', 'polaris'])
sim_threshold (float)
fp_radius (int)
fp_nbits (int)
smiles_col (str | None)
label_col (str | None)
id_col (str | None)
label_cols (List[str] | None)
sequence_col (str | None)
target_id_col (str | None)
name (str | None)
unique_sequences_jsonl (str | None)
foldseek_m8_path (str | None)
- task_type: Literal['classification', 'regression']¶
- typ: Literal['tdc', 'tabular', 'polaris']¶
- sim_threshold: float = 0.9¶
- fp_radius: int = 2¶
- fp_nbits: int = 2048¶
- smiles_col: str | None = None¶
- label_col: str | None = None¶
- id_col: str | None = None¶
- label_cols: List[str] | None = None¶
- sequence_col: str | None = None¶
- target_id_col: str | None = None¶
- name: str | None = None¶
- unique_sequences_jsonl: str | None = None¶
- foldseek_m8_path: str | None = None¶
- class utils.analysis.AnalysisResult(summary: 'Dict[str, Any]', per_record_df: 'pd.DataFrame', conflicts_rows: 'List[Dict[str, Any]]', cliffs_rows: 'List[Dict[str, Any]]', sequence_alignment_rows: 'Optional[List[Dict[str, Any]]]' = None, structure_alignment_rows: 'Optional[List[Dict[str, Any]]]' = None)[source]¶
Bases:
object- Parameters:
summary (Dict[str, Any])
per_record_df (pandas.DataFrame)
conflicts_rows (List[Dict[str, Any]])
cliffs_rows (List[Dict[str, Any]])
sequence_alignment_rows (List[Dict[str, Any]] | None)
structure_alignment_rows (List[Dict[str, Any]] | None)
- summary: Dict[str, Any]¶
- per_record_df: pandas.DataFrame¶
- conflicts_rows: List[Dict[str, Any]]¶
- cliffs_rows: List[Dict[str, Any]]¶
- sequence_alignment_rows: List[Dict[str, Any]] | None = None¶
- structure_alignment_rows: List[Dict[str, Any]] | None = None¶
- utils.analysis.morgan_fps(smiles_list, radius, n_bits)[source]¶
Compute Morgan/ECFP fingerprints. Returns None for invalid SMILES.
- Parameters:
smiles_list (List[str])
radius (int)
n_bits (int)
- Return type:
List[rdkit.DataStructs.ExplicitBitVect | None]
- utils.analysis.scaffold_fps(smiles_list, radius, n_bits)[source]¶
- Parameters:
smiles_list (List[str])
radius (int)
n_bits (int)
- Return type:
List[rdkit.DataStructs.ExplicitBitVect | None]
- class utils.analysis.StretcherAlignment(score: 'float', identity_pct: 'float', similarity_pct: 'float', length: 'int', gaps_pct: 'float', n_gaps: 'int', aligned_query: 'str', aligned_subject: 'str', query_start: 'int', query_end: 'int', subject_start: 'int', subject_end: 'int')[source]¶
Bases:
object- Parameters:
score (float)
identity_pct (float)
similarity_pct (float)
length (int)
gaps_pct (float)
n_gaps (int)
aligned_query (str)
aligned_subject (str)
query_start (int)
query_end (int)
subject_start (int)
subject_end (int)
- score: float¶
- identity_pct: float¶
- similarity_pct: float¶
- length: int¶
- gaps_pct: float¶
- n_gaps: int¶
- aligned_query: str¶
- aligned_subject: str¶
- query_start: int¶
- query_end: int¶
- subject_start: int¶
- subject_end: int¶
- class utils.analysis.PSAStretcherAligner[source]¶
Bases:
objectThin wrapper around psa.stretcher with caching.
- class utils.analysis.SMILESAnalyzer(cfg, logger=None)[source]¶
Bases:
objectSimplified, modular SMILES analyzer (MoleculeACE-style similarity).
- Parameters:
cfg (AnalyzerConfig)
logger (Optional[logging.Logger])
- class utils.analysis.DTIAnalyzer(cfg, logger=None)[source]¶
Bases:
objectDrug–target interaction analysis: combines molecular + sequence hygiene.
- Parameters:
cfg (AnalyzerConfig)
logger (Optional[logging.Logger])
Baselines (utils.baselines)¶
- class utils.baselines.BaselineParams(seed: 'int' = 0, fp_radius: 'int' = 2, fp_nbits: 'int' = 2048, mlp_hidden: 'tuple' = (256, 128), mlp_max_iter: 'int' = 300, rf_estimators: 'int' = 500, lgbm_estimators: 'int' = 800, lgbm_lr: 'float' = 0.05, lgbm_leaves: 'int' = 31, lgbm_subsample: 'float' = 0.8, lgbm_colsample: 'float' = 0.8)[source]¶
Bases:
object- Parameters:
seed (int)
fp_radius (int)
fp_nbits (int)
mlp_hidden (tuple)
mlp_max_iter (int)
rf_estimators (int)
lgbm_estimators (int)
lgbm_lr (float)
lgbm_leaves (int)
lgbm_subsample (float)
lgbm_colsample (float)
- seed: int = 0¶
- fp_radius: int = 2¶
- fp_nbits: int = 2048¶
- mlp_max_iter: int = 300¶
- rf_estimators: int = 500¶
- lgbm_estimators: int = 800¶
- lgbm_lr: float = 0.05¶
- lgbm_leaves: int = 31¶
- lgbm_subsample: float = 0.8¶
- lgbm_colsample: float = 0.8¶
- utils.baselines.eval_baselines_generic(cfg, splits)[source]¶
Train on train only, evaluate on test using standard metrics.
- Parameters:
cfg (Dict[str, Any])
splits (Dict[str, pandas.DataFrame])
- Return type:
Dict[str, Any]