Scripts Overview¶
Most users should start with the pare CLI. The scripts/ directory is for batch workflows, review-set preparation, and post-processing after benchmark runs.
When To Use Scripts¶
Use the scripts in this folder when you need to:
- run repeatable experiment wrappers beyond a single CLI invocation
- batch-generate scenarios or distribute them to reviewers
- aggregate results and produce analysis artifacts
- inspect dataset or app-usage coverage across many runs
If you only want to run the benchmark once, start here instead:
uv run pare benchmark sweep --split full --observe-model gpt-5 --execute-model gpt-5
Core Scripts¶
scripts/run_scenarios.py: run one/all registered scenarios and write trace summaries.scripts/run_scenario_generator_batch.py: batch wrapper for the scenario generator.scripts/create_review_csvs.py: distribute generated scenarios into reviewer buckets.
Analysis Scripts¶
scripts/analyze_metrics.py: aggregate traces into evaluation metrics.scripts/analyze_app_usage.py: inspect app coverage across scenarios.scripts/create_stratified_sample.py: build app-balanced benchmark splits.scripts/plots/plot_ablation_robustness.py: generate robustness plots from combined benchmark results.
Experiment Wrappers¶
scripts/experiments/run_models_sweep.shscripts/experiments/model_sweep_tfp.shscripts/experiments/model_sweep_env_noise.sh
Recommendation¶
For new workflows, prefer the pare CLI where possible:
pare benchmark sweeppare scenarios listpare scenarios generate