Choose scenario and batch API, check models in the grid below, click Estimate Cost, then run the batch.
Requires a saved Settings → Clients & rubric weights assignment for this scenario. Leave none for global (untagged) runs.
All priced OpenRouter models are listed (sort by cost to compare). Only models in the 5 lowest Est. SIM cost tiers can be selected — every model tied at those prices is included (at least five, often more). Check the models you want, click Estimate Cost, then run the batch. Vendor labels come from pricing metadata and do not imply a direct vendor API key unless Settings shows that provider configured.
Batch API — All checked models are sent through OpenRouter.
Callable via — Shows which vendor APIs on this server list each model id (informational only for RuntimeAI; batches always use OpenRouter).
Turns, replications, token averages — Fixed for RuntimeAI (8 turns, 1 replication, 1500 in / 350 out per call). They feed the estimated batch cost line in the panel above.
Preflight — For smaller selections, the server sends a short test call per model before starting runs. If your account has no API credits, every model can fail preflight and the batch will not start; add credits or switch the batch API / keys.
Pick a scenario, then Load aggregate. Scenario results (summary table, chart, and per-model details) always appear in the section below—not inside the folded panel. Rubric is shown on a 0–10 view (2× the underlying 0–5 scoreable-slice aggregate). Fold open Methodology only for the scoring rules and data-source bullets.
Scenario results
All scenarios lists every model-test batch. Pick a single scenario to narrow the list. Use Client to show only batches tagged for that client, or all. When a specific client is selected and the scenario filter matches the assignment, headline scores in batch cards use that client’s weights; per-dimension cells stay unweighted. The Scenario control under Model Costs / Batch Run stays the concrete run target (used when you click Run agent batch).
Use the checkbox on each batch to select it, then Delete selected batches. Click the batch title row (not the checkbox) to expand. Dig in sections hold the comparison table and per-model rows. Expand all opens every panel.
Each card below is either an admin “Run agent batch” (shared batch id) or a
single run from the main sim (AI agent mode, no batch id). Both use the same automated rubric
(total_25 → mean /10). Scoreable means the run ended with enough
protagonist ↔ counterpart turns to compare scores fairly.
Maintained from observed agent check-ride runs (admin batch + main sim AI agent) and the model universe in model-costs.json.
Add a client, then assign a scenario and relative weights for the five automated rubric dimensions (defaults are equal). Under Model Costs / Batch Run, you can tag Run agent batch with a client so Model Test Results can filter runs and apply weighted headline scores for that client. Each scenario needs its own saved row (same client + scenario id as the batch). Below each client, saved rows show blend % by metric; Edit loads that row into the form.
Each slider is 0–10 emphasis (relative only). The line below is your blend as % of 100 (what gets saved). Absolute totals do not affect scoring.
Grounds claims in verifiable facts, not guesses or hand-waving.
Clarifies problem, constraints, and context before committing.
Empathy, tone, pushback, and fit with the counterpart’s needs.
Clear structure and actionable next steps.
Notices and fixes weak or mistaken own moves as the thread evolves.
Only the row for the client and scenario chosen above is written. To add another scenario for the same client, change the scenario dropdown and save again. Use Edit in the table to reopen a row without touching others.
Keys are read from .env (or *_FILE variables) next to server.py.
Restart uvicorn after editing.
Per-provider snapshot from your server keys (keys never appear in the browser). OpenAI shows identity and billing hints when the key can reach OpenAI’s billing endpoints; Anthropic and OpenRouter balances are not exposed on standard API routes.
Provider cards refresh when you open this tab or use Model Costs / Batch Run → Refresh pricing.
Snapshot loads when you click Individual Results → Refresh (same as the primary admin refresh).
Ends every human run still In progress or Pending verification (sets status to Completed). Does not stop agent batch runs (status Agent running).
Anonymous or named feedback from sim and admin pages for this product.