Skip to content

Scripts Overview

Most users should start with the pare CLI. The scripts/ directory is for batch workflows, review-set preparation, and post-processing after benchmark runs.

When To Use Scripts

Use the scripts in this folder when you need to:

  • run repeatable experiment wrappers beyond a single CLI invocation
  • batch-generate scenarios or distribute them to reviewers
  • aggregate results and produce analysis artifacts
  • inspect dataset or app-usage coverage across many runs

If you only want to run the benchmark once, start here instead:

uv run pare benchmark sweep --split full --observe-model gpt-5 --execute-model gpt-5

Core Scripts

  • scripts/run_scenarios.py: run one/all registered scenarios and write trace summaries.
  • scripts/run_scenario_generator_batch.py: batch wrapper for the scenario generator.
  • scripts/create_review_csvs.py: distribute generated scenarios into reviewer buckets.

Analysis Scripts

  • scripts/analyze_metrics.py: aggregate traces into evaluation metrics.
  • scripts/analyze_app_usage.py: inspect app coverage across scenarios.
  • scripts/create_stratified_sample.py: build app-balanced benchmark splits.
  • scripts/plots/plot_ablation_robustness.py: generate robustness plots from combined benchmark results.

Experiment Wrappers

  • scripts/experiments/run_models_sweep.sh
  • scripts/experiments/model_sweep_tfp.sh
  • scripts/experiments/model_sweep_env_noise.sh

Recommendation

For new workflows, prefer the pare CLI where possible:

  • pare benchmark sweep
  • pare scenarios list
  • pare scenarios generate