Loaders and Data Preparation

Loaders (utils.loader)

class utils.loader.BaseLoader(cfg)[source]

Bases: object

Parameters:

cfg (Dict[str, Any])

get_splits()[source]
Return type:

Dict[str, pandas.DataFrame]

class utils.loader.TDCLoader(cfg)[source]

Bases: BaseLoader

Parameters:

cfg (Dict[str, Any])

get_splits()[source]
Return type:

Dict[str, pandas.DataFrame]

class utils.loader.TabularLoader(cfg)[source]

Bases: BaseLoader

Parameters:

cfg (Dict[str, Any])

DEFAULT_SMILES_COLS = ['smiles', 'SMILES', 'drug', 'Drug']
DEFAULT_LABEL_COLS = ['label_raw', 'label', 'Label', 'y', 'Y']
DEFAULT_ID_COLS = ['id', 'ID', 'compound_id', 'compoundID']
DEFAULT_SEQUENCE_COLS = ['sequence_aa', 'sequence', 'Sequence', 'protein_sequence', 'ProteinSequence', 'target_sequence', 'TargetSequence', 'AASequence']
DEFAULT_TARGET_ID_COLS = ['target_id', 'target', 'TargetID', 'protein_id', 'ProteinID']
get_splits()[source]
Return type:

Dict[str, pandas.DataFrame]

class utils.loader.PolarisLoader(cfg)[source]

Bases: BaseLoader

Minimal Polaris loader. Expects cfg = {“type”: “polaris”, “name”: “<vendor/benchmark-id>”}. Returns only {‘train’, ‘test’} with columns: smiles_clean, label_raw, id.

Parameters:

cfg (Dict[str, Any])

get_splits()[source]
Return type:

Dict[str, pandas.DataFrame]

class utils.loader.DTILoader(cfg)[source]

Bases: TabularLoader

DTI loader built on TabularLoader with sensible defaults.

Parameters:

cfg (Dict[str, Any])

Cleaning (utils.cleaner)

class utils.cleaner.REOS(active_rules=None)[source]

Bases: object

REOS - Rapid Elimination Of Swill. Adapted to fit our needs with more rule info.

Walters, Ajay, Murcko, “Recognizing molecules with druglike properties”

Curr. Opin. Chem. Bio., 3 (1999), 384-387

https://doi.org/10.1016/S1367-5931(99)80058-1

Parameters:

active_rules (List[str] | None)

set_output_smarts(output_smarts)[source]

Determine whether SMARTS are returned :param output_smarts: True or False :return: None

parse_smarts()[source]

Parse the SMARTS strings in the rules file to molecule objects and check for validity

Returns:

True if all SMARTS are parsed, False otherwise

read_rules(rules_file, active_rules=None)[source]

Read a rules file

Parameters:
  • rules_file – name of the rules file

  • active_rules – list of active rule sets, all rule sets are used if this is None

Returns:

None

set_active_rule_sets(active_rules=None)[source]

Set the active rule set(s)

Parameters:

active_rules – list of active rule sets

Returns:

None

set_min_priority(min_priority)[source]

Set the minimum priority for rules to be included in the active rule set.

Parameters:

min_priority (int) – The minimum priority for rules to be included.

Returns:

None

Return type:

None

get_available_rule_sets()[source]

Get the available rule sets in rule_df

Returns:

a list of available rule sets

get_active_rule_sets()[source]

Get the active rule sets in active_rule_df

Returns:

a list of active rule sets

drop_rule(description)[source]

Drops a rule from the active rule set based on its description.

Param:

description: The description of the rule to be dropped.

Returns:

None

Parameters:

description (str)

Return type:

None

get_rule_file_location()[source]

Get the path to the rules file as a Path

Returns:

Path for rules file

process_mol(mol, detailed=False)[source]

Match a molecule against the active rule set.

Parameters:
  • mol – input RDKit molecule

  • detailed (bool) – if True, returns additional info regarding all failed rules. If False (default), returns only the first failed rule or “ok”.

Returns:

  • If detailed is False:

    returns a tuple (rule_set_name, description) (or with smarts if output_smarts is True), or (“ok”, “ok”) (or (“ok”, “ok”, “ok”)) if no rule is failed.

  • If detailed is True:
    returns a flattened tuple:
    • If self.output_smarts is False: (rule_set_name, description, num_failed, failed_rules)

    • If self.output_smarts is True: (rule_set_name, description, smarts, num_failed, failed_rules)

process_smiles(smiles, detailed=False)[source]

Convert SMILES to an RDKit molecule and call process_mol.

Parameters:
  • smiles – input SMILES string

  • detailed (bool) – if True, returns additional detailed info from process_mol.

Returns:

the result from process_mol, or None if the SMILES cannot be parsed.

pandas_smiles(smiles_list, detailed=False)[source]

Process a list of SMILES strings and return a DataFrame with the results.

Parameters:
  • smiles_list (List[str]) – list of SMILES strings

  • detailed (bool) – if True, the DataFrame includes two extra columns: ‘num_failed’ (the number of violated rules) and ‘failed_rules’ (the list of rules that failed).

Returns:

pandas DataFrame with the results.

Return type:

pandas.DataFrame

class utils.cleaner.SMILESCleaner(smiles)[source]

Bases: object

Parameters:

smiles (List[str])

canonicalize(smiles)[source]
Parameters:

smiles (str)

Return type:

str

canonicalize_all()[source]
Return type:

None

deduplicate_all()[source]
Return type:

None

static smiles_to_mol(smiles)[source]
Parameters:

smiles (str)

Return type:

rdkit.Chem.rdchem.Mol

static mol_to_molblock(mol)[source]
Parameters:

mol (rdkit.Chem.rdchem.Mol)

Return type:

rdkit.Chem.rdchem.Mol

static molblock_to_mol(molblock)[source]
Parameters:

molblock (rdkit.Chem.rdchem.Mol)

Return type:

rdkit.Chem.rdchem.Mol

static mol_to_smiles(mol)[source]
Parameters:

mol (rdkit.Chem.rdchem.Mol)

Return type:

str

static mol_to_inchi(mol)[source]
Parameters:

mol (rdkit.Chem.rdchem.Mol)

Return type:

str

static standardize(molblock)[source]
Parameters:

molblock (rdkit.Chem.rdchem.Mol)

Return type:

rdkit.Chem.rdchem.Mol

static standardize_mol(mol)[source]

Adapted from https://www.blopig.com/blog/2022/05/molecular-standardization/

Standardize the RDKit molecule, select its parent molecule, uncharge it, then enumerate all the tautomers. If verbose is true, an explanation of the steps and structures of the molecule as it is standardized will be output.

Parameters:

mol (rdkit.Chem.rdchem.Mol)

Return type:

rdkit.Chem.rdchem.Mol

static structure_check(molblock)[source]

The checker assesses the quality of a structure. It highlights specific features or comment in the structure that may need to be revised. Together with the description of the issue, the checker process returns a penalty score (between 0-9) which reflects the seriousness of the issue (the higher the score, the more critical is the issue)

Parameters:

molblock (rdkit.Chem.rdchem.Mol)

Return type:

int

standardize_all()[source]
Return type:

None

reos_filter_all()[source]
Return type:

None

get_valid()[source]
Return type:

pandas.DataFrame

get_data()[source]
Return type:

pandas.DataFrame

Split helpers (utils.splitting)

utils.splitting.random_split_indices(n_items, frac_train, frac_valid, frac_test, seed=123)[source]
Parameters:
  • n_items (int)

  • frac_train (float)

  • frac_valid (float)

  • frac_test (float)

  • seed (int | None)

Return type:

Tuple[List[int], List[int], List[int]]

utils.splitting.scaffold_split_indices(smiles_list, frac_train, frac_valid, frac_test, seed=123)[source]

DeepChem-style scaffold split: group by Bemis-Murcko scaffold, then size-sort.

Parameters:
  • smiles_list (Iterable[str])

  • frac_train (float)

  • frac_valid (float)

  • frac_test (float)

  • seed (int | None)

Return type:

Tuple[List[int], List[int], List[int]]

utils.splitting.split_indices(smiles_list, method, fracs, seed=123)[source]
Parameters:
  • smiles_list (Iterable[str])

  • method (str)

  • seed (int | None)

Return type:

Tuple[List[int], List[int], List[int]]