Loaders and Data Preparation

Loaders (utils.loader)

Dataset loaders for tabular, TDC, Polaris, and DTI benchmark inputs.

class utils.loader.BaseLoader(cfg)[source]

Bases: object

Base class for config-normalized dataset loaders.

Parameters:

cfg (Dict[str, Any])

get_splits()[source]

Return dataframes keyed by canonical split name.

Return type:

Dict[str, pandas.DataFrame]

class utils.loader.TDCLoader(cfg)[source]

Bases: BaseLoader

Loader for Therapeutics Data Commons single-prediction datasets.

Parameters:

cfg (Dict[str, Any])

get_splits()[source]

Load TDC train, validation, and test splits as standardized frames.

Return type:

Dict[str, pandas.DataFrame]

class utils.loader.TabularLoader(cfg)[source]

Bases: BaseLoader

Loader for local CSV, TSV, or Parquet files with configurable columns.

Parameters:

cfg (Dict[str, Any])

DEFAULT_SMILES_COLS = ['smiles', 'SMILES', 'drug', 'Drug']
DEFAULT_LABEL_COLS = ['label_raw', 'label', 'Label', 'y', 'Y']
DEFAULT_ID_COLS = ['id', 'ID', 'compound_id', 'compoundID']
DEFAULT_SEQUENCE_COLS = ['sequence_aa', 'sequence', 'Sequence', 'protein_sequence', 'ProteinSequence', 'target_sequence', 'TargetSequence', 'AASequence']
DEFAULT_TARGET_ID_COLS = ['target_id', 'target', 'TargetID', 'protein_id', 'ProteinID']
get_splits()[source]

Load either explicit split files or a single file with a split column.

Return type:

Dict[str, pandas.DataFrame]

class utils.loader.PolarisLoader(cfg)[source]

Bases: BaseLoader

Minimal Polaris loader. Expects cfg = {“type”: “polaris”, “name”: “<vendor/benchmark-id>”}. Returns only {‘train’, ‘test’} with columns: smiles_clean, label_raw, id.

Parameters:

cfg (Dict[str, Any])

get_splits()[source]

Load a Polaris benchmark into standardized train and test frames.

Return type:

Dict[str, pandas.DataFrame]

class utils.loader.DTILoader(cfg)[source]

Bases: TabularLoader

DTI loader built on TabularLoader with sensible defaults.

Parameters:

cfg (Dict[str, Any])

Cleaning (utils.cleaner)

SMILES cleaning and REOS alert utilities.

class utils.cleaner.REOS(active_rules=None)[source]

Bases: object

REOS - Rapid Elimination Of Swill. Adapted to fit our needs with more rule info.

Walters, Ajay, Murcko, “Recognizing molecules with druglike properties”

Curr. Opin. Chem. Bio., 3 (1999), 384-387

https://doi.org/10.1016/S1367-5931(99)80058-1

Parameters:

active_rules (List[str] | None)

set_output_smarts(output_smarts)[source]

Determine whether SMARTS are returned :param output_smarts: True or False :return: None

parse_smarts()[source]

Parse the SMARTS strings in the rules file to molecule objects and check for validity

Returns:

True if all SMARTS are parsed, False otherwise

read_rules(rules_file, active_rules=None)[source]

Read a rules file

Parameters:
  • rules_file – name of the rules file

  • active_rules – list of active rule sets, all rule sets are used if this is None

Returns:

None

set_active_rule_sets(active_rules=None)[source]

Set the active rule set(s)

Parameters:

active_rules – list of active rule sets

Returns:

None

set_min_priority(min_priority)[source]

Set the minimum priority for rules to be included in the active rule set.

Parameters:

min_priority (int) – The minimum priority for rules to be included.

Returns:

None

Return type:

None

get_available_rule_sets()[source]

Get the available rule sets in rule_df

Returns:

a list of available rule sets

get_active_rule_sets()[source]

Get the active rule sets in active_rule_df

Returns:

a list of active rule sets

drop_rule(description)[source]

Drops a rule from the active rule set based on its description.

Param:

description: The description of the rule to be dropped.

Returns:

None

Parameters:

description (str)

Return type:

None

get_rule_file_location()[source]

Get the path to the rules file as a Path

Returns:

Path for rules file

process_mol(mol, detailed=False)[source]

Match a molecule against the active rule set.

Parameters:
  • mol – input RDKit molecule

  • detailed (bool) – if True, returns additional info regarding all failed rules. If False (default), returns only the first failed rule or “ok”.

Returns:

  • If detailed is False:

    returns a tuple (rule_set_name, description) (or with smarts if output_smarts is True), or (“ok”, “ok”) (or (“ok”, “ok”, “ok”)) if no rule is failed.

  • If detailed is True:
    returns a flattened tuple:
    • If self.output_smarts is False: (rule_set_name, description, num_failed, failed_rules)

    • If self.output_smarts is True: (rule_set_name, description, smarts, num_failed, failed_rules)

process_smiles(smiles, detailed=False)[source]

Convert SMILES to an RDKit molecule and call process_mol.

Parameters:
  • smiles – input SMILES string

  • detailed (bool) – if True, returns additional detailed info from process_mol.

Returns:

the result from process_mol, or None if the SMILES cannot be parsed.

pandas_smiles(smiles_list, detailed=False)[source]

Process a list of SMILES strings and return a DataFrame with the results.

Parameters:
  • smiles_list (List[str]) – list of SMILES strings

  • detailed (bool) – if True, the DataFrame includes two extra columns: ‘num_failed’ (the number of violated rules) and ‘failed_rules’ (the list of rules that failed).

Returns:

pandas DataFrame with the results.

Return type:

pandas.DataFrame

class utils.cleaner.SMILESCleaner(smiles)[source]

Bases: object

Clean, canonicalize, deduplicate, standardize, and annotate SMILES strings.

Parameters:

smiles (List[str])

canonicalize(smiles)[source]
Parameters:

smiles (str)

Return type:

str

canonicalize_all()[source]
Return type:

None

deduplicate_all()[source]
Return type:

None

static smiles_to_mol(smiles)[source]
Parameters:

smiles (str)

Return type:

rdkit.Chem.rdchem.Mol

static mol_to_molblock(mol)[source]
Parameters:

mol (rdkit.Chem.rdchem.Mol)

Return type:

rdkit.Chem.rdchem.Mol

static molblock_to_mol(molblock)[source]
Parameters:

molblock (rdkit.Chem.rdchem.Mol)

Return type:

rdkit.Chem.rdchem.Mol

static mol_to_smiles(mol)[source]
Parameters:

mol (rdkit.Chem.rdchem.Mol)

Return type:

str

static mol_to_inchi(mol)[source]
Parameters:

mol (rdkit.Chem.rdchem.Mol)

Return type:

str

static standardize(molblock)[source]
Parameters:

molblock (rdkit.Chem.rdchem.Mol)

Return type:

rdkit.Chem.rdchem.Mol

static standardize_mol(mol)[source]

Adapted from https://www.blopig.com/blog/2022/05/molecular-standardization/

Standardize the RDKit molecule, select its parent molecule, uncharge it, then enumerate all the tautomers. If verbose is true, an explanation of the steps and structures of the molecule as it is standardized will be output.

Parameters:

mol (rdkit.Chem.rdchem.Mol)

Return type:

rdkit.Chem.rdchem.Mol

static structure_check(molblock)[source]

The checker assesses the quality of a structure. It highlights specific features or comment in the structure that may need to be revised. Together with the description of the issue, the checker process returns a penalty score (between 0-9) which reflects the seriousness of the issue (the higher the score, the more critical is the issue)

Parameters:

molblock (rdkit.Chem.rdchem.Mol)

Return type:

int

standardize_all()[source]
Return type:

None

reos_filter_all()[source]
Return type:

None

get_valid()[source]
Return type:

pandas.DataFrame

get_data()[source]
Return type:

pandas.DataFrame

Split helpers (utils.splitting)

Split-index helpers for random and scaffold-based dataset partitioning.

utils.splitting.random_split_indices(n_items, frac_train, frac_valid, frac_test, seed=123)[source]

Return shuffled train, validation, and test indices for n_items.

Parameters:
  • n_items (int)

  • frac_train (float)

  • frac_valid (float)

  • frac_test (float)

  • seed (int | None)

Return type:

Tuple[List[int], List[int], List[int]]

utils.splitting.scaffold_split_indices(smiles_list, frac_train, frac_valid, frac_test, seed=123)[source]

DeepChem-style scaffold split: group by Bemis-Murcko scaffold, then size-sort.

Parameters:
  • smiles_list (Iterable[str])

  • frac_train (float)

  • frac_valid (float)

  • frac_test (float)

  • seed (int | None)

Return type:

Tuple[List[int], List[int], List[int]]

utils.splitting.split_indices(smiles_list, method, fracs, seed=123)[source]

Dispatch to a supported split strategy and return split indices.

Parameters:
  • smiles_list (Iterable[str])

  • method (str)

  • seed (int | None)

Return type:

Tuple[List[int], List[int], List[int]]