Rank-Fragility Analysis¶

The utils.rank_fragility package evaluates whether molecular leaderboard rankings remain stable when the audited composition of a test panel changes.

Configuration (`utils.rank_fragility.config`)¶

Configuration dataclasses for rank-fragility analysis.

class utils.rank_fragility.config.AuditConfig(id_col='molecule_id', smiles_col='smiles', label_col='y', split_col='split', task='classification', near_leak_thresholds=(0.85, 0.9), primary_near_leak_threshold=0.85, regression_conflict_threshold=1.0, regression_conflict_threshold_sensitivity=None, random_seed=13)[source]¶

Bases: object

Column names and thresholds used to annotate audited molecules.

Parameters:

id_col (str)
smiles_col (str)
label_col (str)
split_col (str)
task (Literal['classification', 'regression'])
near_leak_thresholds (tuple[float, ...])
primary_near_leak_threshold (float)
regression_conflict_threshold (float)
regression_conflict_threshold_sensitivity (float | None)
random_seed (int)

id_col: str = 'molecule_id'¶

smiles_col: str = 'smiles'¶

label_col: str = 'y'¶

split_col: str = 'split'¶

task: Literal['classification', 'regression'] = 'classification'¶

near_leak_thresholds: tuple[float, ...] = (0.85, 0.9)¶

primary_near_leak_threshold: float = 0.85¶

regression_conflict_threshold: float = 1.0¶

regression_conflict_threshold_sensitivity: float | None = None¶

random_seed: int = 13¶

class utils.rank_fragility.config.PanelConfig(id_col='molecule_id', label_col='y', task='classification', panel_size='auto', n_panels=1000, target_rates=(0.0, 0.05, 0.1, 0.25, 'observed', 0.5, 0.75), random_seed=13, output_dir=PosixPath('runs/rank_fragility'))[source]¶

Bases: object

Sampling controls for generated counterfactual evaluation panels.

Parameters:

id_col (str)
label_col (str)
task (Literal['classification', 'regression'])
panel_size (int | str)
n_panels (int)
target_rates (tuple[float | str, ...])
random_seed (int)
output_dir (Path | str)

id_col: str = 'molecule_id'¶

label_col: str = 'y'¶

task: Literal['classification', 'regression'] = 'classification'¶

panel_size: int | str = 'auto'¶

n_panels: int = 1000¶

target_rates: tuple[float | str, ...] = (0.0, 0.05, 0.1, 0.25, 'observed', 0.5, 0.75)¶

random_seed: int = 13¶

output_dir: Path | str = PosixPath('runs/rank_fragility')¶

class utils.rank_fragility.config.MetricConfig(task='classification', metric='auroc', baseline_model='ecfp_rf', sota_model='auto')[source]¶

Bases: object

Metric and model-selection settings for leaderboard comparisons.

Parameters:

task (Literal['classification', 'regression'])
metric (str)
baseline_model (str)
sota_model (str)

task: Literal['classification', 'regression'] = 'classification'¶

metric: str = 'auroc'¶

baseline_model: str = 'ecfp_rf'¶

sota_model: str = 'auto'¶

class utils.rank_fragility.config.RunConfig(data, pred_dir, id_col='molecule_id', smiles_col='smiles', label_col='y', split_col='split', task='classification', metric='auroc', near_leak_thresholds=(0.85, 0.9), primary_near_leak_threshold=0.85, regression_conflict_threshold=1.0, regression_conflict_threshold_sensitivity=None, random_seed=13, panel_size='auto', n_panels=1000, target_rates=<factory>, baseline_model='ecfp_rf', sota_model='auto', output_dir=PosixPath('runs/rank_fragility'))[source]¶

Bases: object

Complete input, audit, panel, and output settings for one analysis run.

Parameters:

data (Path)
pred_dir (Path)
id_col (str)
smiles_col (str)
label_col (str)
split_col (str)
task (Literal['classification', 'regression'])
metric (str)
near_leak_thresholds (tuple[float, ...])
primary_near_leak_threshold (float)
regression_conflict_threshold (float)
regression_conflict_threshold_sensitivity (float | None)
random_seed (int)
panel_size (int | str)
n_panels (int)
target_rates (tuple[float | str, ...])
baseline_model (str)
sota_model (str)
output_dir (Path)

data: Path¶

pred_dir: Path¶

id_col: str = 'molecule_id'¶

smiles_col: str = 'smiles'¶

label_col: str = 'y'¶

split_col: str = 'split'¶

task: Literal['classification', 'regression'] = 'classification'¶

metric: str = 'auroc'¶

near_leak_thresholds: tuple[float, ...] = (0.85, 0.9)¶

primary_near_leak_threshold: float = 0.85¶

regression_conflict_threshold: float = 1.0¶

regression_conflict_threshold_sensitivity: float | None = None¶

random_seed: int = 13¶

panel_size: int | str = 'auto'¶

n_panels: int = 1000¶

target_rates: tuple[float | str, ...]¶

baseline_model: str = 'ecfp_rf'¶

sota_model: str = 'auto'¶

output_dir: Path = PosixPath('runs/rank_fragility')¶

audit_config()[source]¶

Return the audit-specific subset of this run configuration.

Return type:: AuditConfig

panel_config()[source]¶

Return the panel-sampling subset of this run configuration.

Return type:: PanelConfig

metric_config()[source]¶

Return the metric/model subset of this run configuration.

Return type:: MetricConfig

Audit and predictions¶

Molecular audit annotations for rank-fragility analysis.

utils.rank_fragility.audit.audit_dataset(df, config)[source]¶

Annotate dataset rows with chemistry and train-test audit flags.

Parameters:

df (pandas.DataFrame)
config (AuditConfig)

Return type:

pandas.DataFrame

utils.rank_fragility.audit.summarize_audit(audited_df)[source]¶

Return a long-form audit summary table.

Parameters:: audited_df (pandas.DataFrame)
Return type:: pandas.DataFrame

Prediction loading and audit-merge helpers.

utils.rank_fragility.predictions.load_prediction_directory(pred_dir)[source]¶

Load one prediction CSV per model into long format.

Parameters:: pred_dir (str)
Return type:: pandas.DataFrame

utils.rank_fragility.predictions.merge_predictions_with_audit(pred_df, audited_df, config)[source]¶

Validate prediction labels and attach audit annotations for test rows.

Parameters:

pred_df (pandas.DataFrame)
audited_df (pandas.DataFrame)

Return type:

pandas.DataFrame

Panels, metrics, and leaderboards¶

Counterfactual evaluation-panel sampling utilities.

utils.rank_fragility.panels.generate_counterfactual_panels(audited_test_df, config)[source]¶

Generate a long-form counterfactual panel manifest.

Parameters:

audited_test_df (pandas.DataFrame)
config (PanelConfig)

Return type:

pandas.DataFrame

Metric helpers for rank-fragility leaderboard evaluation.

utils.rank_fragility.metrics.higher_is_better(metric)[source]¶

Return whether larger values indicate better performance.

Parameters:: metric (str)
Return type:: bool

utils.rank_fragility.metrics.compute_metric(y_true, y_pred, task, metric)[source]¶

Compute one supported classification or regression metric.

Parameters:

task (str)
metric (str)

Return type:

float

utils.rank_fragility.metrics.per_sample_loss(y_true, y_pred, task, loss)[source]¶

Return per-example loss values for attribution summaries.

Parameters:

task (str)
loss (str)

Return type:

numpy.ndarray

Leaderboard scoring and ranking helpers.

utils.rank_fragility.leaderboard.evaluate_models(pred_audit_df, subset_ids, task, metric)[source]¶

Evaluate every model on a molecule-id subset.

Parameters:

pred_audit_df (pandas.DataFrame)
task (str)
metric (str)

Return type:

pandas.DataFrame

utils.rank_fragility.leaderboard.rank_models(scores_df, metric)[source]¶

Assign average ranks with rank 1 as best.

Parameters:

scores_df (pandas.DataFrame)
metric (str)

Return type:

pandas.DataFrame

utils.rank_fragility.leaderboard.original_leaderboard(pred_audit_df, task, metric)[source]¶

Evaluate and rank models on the full original test set.

Parameters:

pred_audit_df (pandas.DataFrame)
task (str)
metric (str)

Return type:

pandas.DataFrame

Counterfactual outputs¶

Counterfactual panel evaluation and aggregate stability summaries.

utils.rank_fragility.counterfactual.run_counterfactual_evaluation(pred_audit_df, panel_manifest, task, metric, baseline_model, sota_model)[source]¶

Evaluate all models on each counterfactual panel and aggregate stability summaries.

Parameters:

pred_audit_df (pandas.DataFrame)
panel_manifest (pandas.DataFrame)
task (str)
metric (str)
baseline_model (str)
sota_model (str)

Return type:

dict[str, pandas.DataFrame]

Summary helpers for composition-driven leaderboard fragility.

utils.rank_fragility.fragility.compute_fragility_summary(rank_probabilities, sota_margin_by_composition, sota_model)[source]¶

Summarize composition rates where the SOTA conclusion becomes fragile.

Parameters:

rank_probabilities (pandas.DataFrame)
sota_margin_by_composition (pandas.DataFrame)
sota_model (str)

Return type:

pandas.DataFrame

Advantage decomposition helpers for audited prediction tables.

utils.rank_fragility.attribution.compute_advantage_decomposition(pred_audit_df, sota_model, baseline_model, task, loss)[source]¶

Compute per-example SOTA advantage and aggregate by chemistry audit strata.

Parameters:

pred_audit_df (pandas.DataFrame)
sota_model (str)
baseline_model (str)
task (str)
loss (str)

Return type:

tuple[pandas.DataFrame, pandas.DataFrame]

Command-line driver¶

Command-line driver for rank-fragility analysis.

utils.rank_fragility.run.build_arg_parser()[source]¶

Build the command-line parser for single-run and batch analysis.

Return type:: ArgumentParser

utils.rank_fragility.run.run_analysis(config)[source]¶

Run one rank-fragility analysis and write its CSV outputs.

Parameters:: config (RunConfig)
Return type:: dict[str, pandas.DataFrame]

utils.rank_fragility.run.config_from_args(args)[source]¶

Convert parsed command-line arguments into a run configuration.

Parameters:: args (Namespace)
Return type:: RunConfig

utils.rank_fragility.run.main(argv=None)[source]¶

Execute rank-fragility analysis from command-line arguments.

Parameters:: argv (Sequence[str] | None)
Return type:: None

Rank-Fragility Analysis¶

Configuration (utils.rank_fragility.config)¶

Audit and predictions¶

Panels, metrics, and leaderboards¶

Counterfactual outputs¶

Command-line driver¶

Configuration (`utils.rank_fragility.config`)¶