Loaders and Data Preparation¶
Loaders (utils.loader)¶
- class utils.loader.TDCLoader(cfg)[source]¶
Bases:
BaseLoader- Parameters:
cfg (Dict[str, Any])
- class utils.loader.TabularLoader(cfg)[source]¶
Bases:
BaseLoader- Parameters:
cfg (Dict[str, Any])
- DEFAULT_SMILES_COLS = ['smiles', 'SMILES', 'drug', 'Drug']¶
- DEFAULT_LABEL_COLS = ['label_raw', 'label', 'Label', 'y', 'Y']¶
- DEFAULT_ID_COLS = ['id', 'ID', 'compound_id', 'compoundID']¶
- DEFAULT_SEQUENCE_COLS = ['sequence_aa', 'sequence', 'Sequence', 'protein_sequence', 'ProteinSequence', 'target_sequence', 'TargetSequence', 'AASequence']¶
- DEFAULT_TARGET_ID_COLS = ['target_id', 'target', 'TargetID', 'protein_id', 'ProteinID']¶
- class utils.loader.PolarisLoader(cfg)[source]¶
Bases:
BaseLoaderMinimal Polaris loader. Expects cfg = {“type”: “polaris”, “name”: “<vendor/benchmark-id>”}. Returns only {‘train’, ‘test’} with columns: smiles_clean, label_raw, id.
- Parameters:
cfg (Dict[str, Any])
- class utils.loader.DTILoader(cfg)[source]¶
Bases:
TabularLoaderDTI loader built on TabularLoader with sensible defaults.
- Parameters:
cfg (Dict[str, Any])
Cleaning (utils.cleaner)¶
- class utils.cleaner.REOS(active_rules=None)[source]¶
Bases:
objectREOS - Rapid Elimination Of Swill. Adapted to fit our needs with more rule info.
Walters, Ajay, Murcko, “Recognizing molecules with druglike properties”
Curr. Opin. Chem. Bio., 3 (1999), 384-387
https://doi.org/10.1016/S1367-5931(99)80058-1
- Parameters:
active_rules (List[str] | None)
- set_output_smarts(output_smarts)[source]¶
Determine whether SMARTS are returned :param output_smarts: True or False :return: None
- parse_smarts()[source]¶
Parse the SMARTS strings in the rules file to molecule objects and check for validity
- Returns:
True if all SMARTS are parsed, False otherwise
- read_rules(rules_file, active_rules=None)[source]¶
Read a rules file
- Parameters:
rules_file – name of the rules file
active_rules – list of active rule sets, all rule sets are used if this is None
- Returns:
None
- set_active_rule_sets(active_rules=None)[source]¶
Set the active rule set(s)
- Parameters:
active_rules – list of active rule sets
- Returns:
None
- set_min_priority(min_priority)[source]¶
Set the minimum priority for rules to be included in the active rule set.
- Parameters:
min_priority (int) – The minimum priority for rules to be included.
- Returns:
None
- Return type:
None
- get_available_rule_sets()[source]¶
Get the available rule sets in rule_df
- Returns:
a list of available rule sets
- get_active_rule_sets()[source]¶
Get the active rule sets in active_rule_df
- Returns:
a list of active rule sets
- drop_rule(description)[source]¶
Drops a rule from the active rule set based on its description.
- Param:
description: The description of the rule to be dropped.
- Returns:
None
- Parameters:
description (str)
- Return type:
None
- get_rule_file_location()[source]¶
Get the path to the rules file as a Path
- Returns:
Path for rules file
- process_mol(mol, detailed=False)[source]¶
Match a molecule against the active rule set.
- Parameters:
mol – input RDKit molecule
detailed (bool) – if True, returns additional info regarding all failed rules. If False (default), returns only the first failed rule or “ok”.
- Returns:
- If detailed is False:
returns a tuple (rule_set_name, description) (or with smarts if output_smarts is True), or (“ok”, “ok”) (or (“ok”, “ok”, “ok”)) if no rule is failed.
- If detailed is True:
- returns a flattened tuple:
If self.output_smarts is False: (rule_set_name, description, num_failed, failed_rules)
If self.output_smarts is True: (rule_set_name, description, smarts, num_failed, failed_rules)
- process_smiles(smiles, detailed=False)[source]¶
Convert SMILES to an RDKit molecule and call process_mol.
- Parameters:
smiles – input SMILES string
detailed (bool) – if True, returns additional detailed info from process_mol.
- Returns:
the result from process_mol, or None if the SMILES cannot be parsed.
- pandas_smiles(smiles_list, detailed=False)[source]¶
Process a list of SMILES strings and return a DataFrame with the results.
- Parameters:
smiles_list (List[str]) – list of SMILES strings
detailed (bool) – if True, the DataFrame includes two extra columns: ‘num_failed’ (the number of violated rules) and ‘failed_rules’ (the list of rules that failed).
- Returns:
pandas DataFrame with the results.
- Return type:
pandas.DataFrame
- class utils.cleaner.SMILESCleaner(smiles)[source]¶
Bases:
object- Parameters:
smiles (List[str])
- static mol_to_molblock(mol)[source]¶
- Parameters:
mol (rdkit.Chem.rdchem.Mol)
- Return type:
rdkit.Chem.rdchem.Mol
- static molblock_to_mol(molblock)[source]¶
- Parameters:
molblock (rdkit.Chem.rdchem.Mol)
- Return type:
rdkit.Chem.rdchem.Mol
- static standardize(molblock)[source]¶
- Parameters:
molblock (rdkit.Chem.rdchem.Mol)
- Return type:
rdkit.Chem.rdchem.Mol
- static standardize_mol(mol)[source]¶
Adapted from https://www.blopig.com/blog/2022/05/molecular-standardization/
Standardize the RDKit molecule, select its parent molecule, uncharge it, then enumerate all the tautomers. If verbose is true, an explanation of the steps and structures of the molecule as it is standardized will be output.
- Parameters:
mol (rdkit.Chem.rdchem.Mol)
- Return type:
rdkit.Chem.rdchem.Mol
- static structure_check(molblock)[source]¶
The checker assesses the quality of a structure. It highlights specific features or comment in the structure that may need to be revised. Together with the description of the issue, the checker process returns a penalty score (between 0-9) which reflects the seriousness of the issue (the higher the score, the more critical is the issue)
- Parameters:
molblock (rdkit.Chem.rdchem.Mol)
- Return type:
int
Split helpers (utils.splitting)¶
- utils.splitting.random_split_indices(n_items, frac_train, frac_valid, frac_test, seed=123)[source]¶
- Parameters:
n_items (int)
frac_train (float)
frac_valid (float)
frac_test (float)
seed (int | None)
- Return type:
Tuple[List[int], List[int], List[int]]
- utils.splitting.scaffold_split_indices(smiles_list, frac_train, frac_valid, frac_test, seed=123)[source]¶
DeepChem-style scaffold split: group by Bemis-Murcko scaffold, then size-sort.
- Parameters:
smiles_list (Iterable[str])
frac_train (float)
frac_valid (float)
frac_test (float)
seed (int | None)
- Return type:
Tuple[List[int], List[int], List[int]]