Rank-Fragility Analysis

The utils.rank_fragility package evaluates whether molecular leaderboard rankings remain stable when the audited composition of a test panel changes.

Configuration (utils.rank_fragility.config)

Configuration dataclasses for rank-fragility analysis.

class utils.rank_fragility.config.AuditConfig(id_col='molecule_id', smiles_col='smiles', label_col='y', split_col='split', task='classification', near_leak_thresholds=(0.85, 0.9), primary_near_leak_threshold=0.85, regression_conflict_threshold=1.0, regression_conflict_threshold_sensitivity=None, random_seed=13)[source]

Bases: object

Column names and thresholds used to annotate audited molecules.

Parameters:
  • id_col (str)

  • smiles_col (str)

  • label_col (str)

  • split_col (str)

  • task (Literal['classification', 'regression'])

  • near_leak_thresholds (tuple[float, ...])

  • primary_near_leak_threshold (float)

  • regression_conflict_threshold (float)

  • regression_conflict_threshold_sensitivity (float | None)

  • random_seed (int)

id_col: str = 'molecule_id'
smiles_col: str = 'smiles'
label_col: str = 'y'
split_col: str = 'split'
task: Literal['classification', 'regression'] = 'classification'
near_leak_thresholds: tuple[float, ...] = (0.85, 0.9)
primary_near_leak_threshold: float = 0.85
regression_conflict_threshold: float = 1.0
regression_conflict_threshold_sensitivity: float | None = None
random_seed: int = 13
class utils.rank_fragility.config.PanelConfig(id_col='molecule_id', label_col='y', task='classification', panel_size='auto', n_panels=1000, target_rates=(0.0, 0.05, 0.1, 0.25, 'observed', 0.5, 0.75), random_seed=13, output_dir=PosixPath('runs/rank_fragility'))[source]

Bases: object

Sampling controls for generated counterfactual evaluation panels.

Parameters:
  • id_col (str)

  • label_col (str)

  • task (Literal['classification', 'regression'])

  • panel_size (int | str)

  • n_panels (int)

  • target_rates (tuple[float | str, ...])

  • random_seed (int)

  • output_dir (Path | str)

id_col: str = 'molecule_id'
label_col: str = 'y'
task: Literal['classification', 'regression'] = 'classification'
panel_size: int | str = 'auto'
n_panels: int = 1000
target_rates: tuple[float | str, ...] = (0.0, 0.05, 0.1, 0.25, 'observed', 0.5, 0.75)
random_seed: int = 13
output_dir: Path | str = PosixPath('runs/rank_fragility')
class utils.rank_fragility.config.MetricConfig(task='classification', metric='auroc', baseline_model='ecfp_rf', sota_model='auto')[source]

Bases: object

Metric and model-selection settings for leaderboard comparisons.

Parameters:
  • task (Literal['classification', 'regression'])

  • metric (str)

  • baseline_model (str)

  • sota_model (str)

task: Literal['classification', 'regression'] = 'classification'
metric: str = 'auroc'
baseline_model: str = 'ecfp_rf'
sota_model: str = 'auto'
class utils.rank_fragility.config.RunConfig(data, pred_dir, id_col='molecule_id', smiles_col='smiles', label_col='y', split_col='split', task='classification', metric='auroc', near_leak_thresholds=(0.85, 0.9), primary_near_leak_threshold=0.85, regression_conflict_threshold=1.0, regression_conflict_threshold_sensitivity=None, random_seed=13, panel_size='auto', n_panels=1000, target_rates=<factory>, baseline_model='ecfp_rf', sota_model='auto', output_dir=PosixPath('runs/rank_fragility'))[source]

Bases: object

Complete input, audit, panel, and output settings for one analysis run.

Parameters:
  • data (Path)

  • pred_dir (Path)

  • id_col (str)

  • smiles_col (str)

  • label_col (str)

  • split_col (str)

  • task (Literal['classification', 'regression'])

  • metric (str)

  • near_leak_thresholds (tuple[float, ...])

  • primary_near_leak_threshold (float)

  • regression_conflict_threshold (float)

  • regression_conflict_threshold_sensitivity (float | None)

  • random_seed (int)

  • panel_size (int | str)

  • n_panels (int)

  • target_rates (tuple[float | str, ...])

  • baseline_model (str)

  • sota_model (str)

  • output_dir (Path)

data: Path
pred_dir: Path
id_col: str = 'molecule_id'
smiles_col: str = 'smiles'
label_col: str = 'y'
split_col: str = 'split'
task: Literal['classification', 'regression'] = 'classification'
metric: str = 'auroc'
near_leak_thresholds: tuple[float, ...] = (0.85, 0.9)
primary_near_leak_threshold: float = 0.85
regression_conflict_threshold: float = 1.0
regression_conflict_threshold_sensitivity: float | None = None
random_seed: int = 13
panel_size: int | str = 'auto'
n_panels: int = 1000
target_rates: tuple[float | str, ...]
baseline_model: str = 'ecfp_rf'
sota_model: str = 'auto'
output_dir: Path = PosixPath('runs/rank_fragility')
audit_config()[source]

Return the audit-specific subset of this run configuration.

Return type:

AuditConfig

panel_config()[source]

Return the panel-sampling subset of this run configuration.

Return type:

PanelConfig

metric_config()[source]

Return the metric/model subset of this run configuration.

Return type:

MetricConfig

Audit and predictions

Molecular audit annotations for rank-fragility analysis.

utils.rank_fragility.audit.audit_dataset(df, config)[source]

Annotate dataset rows with chemistry and train-test audit flags.

Parameters:
Return type:

pandas.DataFrame

utils.rank_fragility.audit.summarize_audit(audited_df)[source]

Return a long-form audit summary table.

Parameters:

audited_df (pandas.DataFrame)

Return type:

pandas.DataFrame

Prediction loading and audit-merge helpers.

utils.rank_fragility.predictions.load_prediction_directory(pred_dir)[source]

Load one prediction CSV per model into long format.

Parameters:

pred_dir (str)

Return type:

pandas.DataFrame

utils.rank_fragility.predictions.merge_predictions_with_audit(pred_df, audited_df, config)[source]

Validate prediction labels and attach audit annotations for test rows.

Parameters:
  • pred_df (pandas.DataFrame)

  • audited_df (pandas.DataFrame)

Return type:

pandas.DataFrame

Panels, metrics, and leaderboards

Counterfactual evaluation-panel sampling utilities.

utils.rank_fragility.panels.generate_counterfactual_panels(audited_test_df, config)[source]

Generate a long-form counterfactual panel manifest.

Parameters:
  • audited_test_df (pandas.DataFrame)

  • config (PanelConfig)

Return type:

pandas.DataFrame

Metric helpers for rank-fragility leaderboard evaluation.

utils.rank_fragility.metrics.higher_is_better(metric)[source]

Return whether larger values indicate better performance.

Parameters:

metric (str)

Return type:

bool

utils.rank_fragility.metrics.compute_metric(y_true, y_pred, task, metric)[source]

Compute one supported classification or regression metric.

Parameters:
  • task (str)

  • metric (str)

Return type:

float

utils.rank_fragility.metrics.per_sample_loss(y_true, y_pred, task, loss)[source]

Return per-example loss values for attribution summaries.

Parameters:
  • task (str)

  • loss (str)

Return type:

numpy.ndarray

Leaderboard scoring and ranking helpers.

utils.rank_fragility.leaderboard.evaluate_models(pred_audit_df, subset_ids, task, metric)[source]

Evaluate every model on a molecule-id subset.

Parameters:
  • pred_audit_df (pandas.DataFrame)

  • task (str)

  • metric (str)

Return type:

pandas.DataFrame

utils.rank_fragility.leaderboard.rank_models(scores_df, metric)[source]

Assign average ranks with rank 1 as best.

Parameters:
  • scores_df (pandas.DataFrame)

  • metric (str)

Return type:

pandas.DataFrame

utils.rank_fragility.leaderboard.original_leaderboard(pred_audit_df, task, metric)[source]

Evaluate and rank models on the full original test set.

Parameters:
  • pred_audit_df (pandas.DataFrame)

  • task (str)

  • metric (str)

Return type:

pandas.DataFrame

Counterfactual outputs

Counterfactual panel evaluation and aggregate stability summaries.

utils.rank_fragility.counterfactual.run_counterfactual_evaluation(pred_audit_df, panel_manifest, task, metric, baseline_model, sota_model)[source]

Evaluate all models on each counterfactual panel and aggregate stability summaries.

Parameters:
  • pred_audit_df (pandas.DataFrame)

  • panel_manifest (pandas.DataFrame)

  • task (str)

  • metric (str)

  • baseline_model (str)

  • sota_model (str)

Return type:

dict[str, pandas.DataFrame]

Summary helpers for composition-driven leaderboard fragility.

utils.rank_fragility.fragility.compute_fragility_summary(rank_probabilities, sota_margin_by_composition, sota_model)[source]

Summarize composition rates where the SOTA conclusion becomes fragile.

Parameters:
  • rank_probabilities (pandas.DataFrame)

  • sota_margin_by_composition (pandas.DataFrame)

  • sota_model (str)

Return type:

pandas.DataFrame

Advantage decomposition helpers for audited prediction tables.

utils.rank_fragility.attribution.compute_advantage_decomposition(pred_audit_df, sota_model, baseline_model, task, loss)[source]

Compute per-example SOTA advantage and aggregate by chemistry audit strata.

Parameters:
  • pred_audit_df (pandas.DataFrame)

  • sota_model (str)

  • baseline_model (str)

  • task (str)

  • loss (str)

Return type:

tuple[pandas.DataFrame, pandas.DataFrame]

Command-line driver

Command-line driver for rank-fragility analysis.

utils.rank_fragility.run.build_arg_parser()[source]

Build the command-line parser for single-run and batch analysis.

Return type:

ArgumentParser

utils.rank_fragility.run.run_analysis(config)[source]

Run one rank-fragility analysis and write its CSV outputs.

Parameters:

config (RunConfig)

Return type:

dict[str, pandas.DataFrame]

utils.rank_fragility.run.config_from_args(args)[source]

Convert parsed command-line arguments into a run configuration.

Parameters:

args (Namespace)

Return type:

RunConfig

utils.rank_fragility.run.main(argv=None)[source]

Execute rank-fragility analysis from command-line arguments.

Parameters:

argv (Sequence[str] | None)

Return type:

None