Vantage SimOps·Admin
Choose scenario and batch API, pick models in the Model cost estimates grid below, then start the batch.
Requires a saved Settings → Clients & rubric weights assignment for this scenario. Leave none for global (untagged) runs.
Pick models, then choose which API runs the batch. The table’s Vendor column comes from
pricing metadata (e.g. gpt-… rows say OpenAI) — it does not mean your server has an
OpenAI key unless Settings shows OpenAI configured.
When Callable via lists multiple backends, Est. SIM cost, In $/1M, and
Out $/1M use value / value in that same order; ?? means no USD rate for that path in
model-costs.json yet (only the route that matches Vendor has numbers today).
Batch API — All checked models are sent to that endpoint by default
(OpenRouter and Anthropic; OpenAI appears here only when the server has OPENAI_API_KEY).
Pick Anthropic only if every model id exists on Anthropic’s API.
Callable via — Shows which vendor APIs on this server list each model id. The batch panel blocks “Run agent batch” if a selected model’s Callable via column does not include the backend the batch will actually call (given “Batch API” and “Route Claude via Anthropic”).
Route Claude via Anthropic — Optional. When checked, Claude-family ids use
ANTHROPIC_API_KEY; everything else still uses the API you picked above. Leave off if you
only use OpenRouter or your Anthropic key is wrong—otherwise preflight fails for Claude rows.
Replications — How many full simulations to run per selected model (same
scenario). Use 1 for a single run, or raise it to compare variance across runs.
Preflight — For smaller selections, the server sends a short test call per model before starting runs. If your account has no API credits, every model can fail preflight and the batch will not start; add credits, switch the batch API / keys, or enable Skip preflight to enqueue runs anyway (each run may still error if the account cannot call the model).
Token averages — These feed the estimated batch cost line in the panel above (same formula as this grid).
Defaults 1500 in and 350 out come from
model-costs.json → assumptions (calibration for multi-turn sims; not
auto-measured from transcripts). Edit for what-if analysis; Est. SIM cost recomputes
and matches the same file the sim uses for cost labels.
Pick a scenario, then Load aggregate. Scenario results (summary table, chart, and per-model details) always appear in the section below—not inside the folded panel. Rubric is shown on a 0–10 view (2× the underlying 0–5 scoreable-slice aggregate). Fold open Methodology only for the scoring rules and data-source bullets.
Scenario results
All scenarios lists every model-test batch. Pick a single scenario to narrow the list. Use Client to show only batches tagged for that client, or all. When a specific client is selected and the scenario filter matches the assignment, headline scores in batch cards use that client’s weights; per-dimension cells stay unweighted. The Scenario control under Model Costs / Batch Run stays the concrete run target (used when you click Run agent batch).
Use the checkbox on each batch to select it, then Delete selected batches. Click the batch title row (not the checkbox) to expand. Dig in sections hold the comparison table and per-model rows. Expand all opens every panel.
Each card below is either an admin “Run agent batch” (shared batch id) or a
single run from the main sim (AI agent mode, no batch id). Both use the same automated rubric
(total_25 → mean /10). Scoreable means the run ended with enough
protagonist ↔ counterpart turns to compare scores fairly.
Maintained from observed agent check-ride runs (admin batch + main sim AI agent) and the model universe in model-costs.json.
Add a client, then assign a scenario and relative weights for the five automated rubric dimensions (defaults are equal). Under Model Costs / Batch Run, you can tag Run agent batch with a client so Model Test Results can filter runs and apply weighted headline scores for that client. Each scenario needs its own saved row (same client + scenario id as the batch). Below each client, saved rows show blend % by metric; Edit loads that row into the form.
Each slider is 0–10 emphasis (relative only). The line below is your blend as % of 100 (what gets saved). Absolute totals do not affect scoring.
Grounds claims in verifiable facts, not guesses or hand-waving.
Clarifies problem, constraints, and context before committing.
Empathy, tone, pushback, and fit with the counterpart’s needs.
Clear structure and actionable next steps.
Notices and fixes weak or mistaken own moves as the thread evolves.
Only the row for the client and scenario chosen above is written. To add another scenario for the same client, change the scenario dropdown and save again. Use Edit in the table to reopen a row without touching others.
Keys are read from .env (or *_FILE variables) next to server.py.
Restart uvicorn after editing.
Per-provider snapshot from your server keys (keys never appear in the browser). OpenAI shows identity and billing hints when the key can reach OpenAI’s billing endpoints; Anthropic and OpenRouter balances are not exposed on standard API routes.
Provider cards refresh when you open this tab or use Model Costs / Batch Run → Refresh pricing.
Snapshot loads when you click Individual Results → Refresh (same as the primary admin refresh).
Ends every human run still In progress or Pending verification (sets status to Completed). Does not stop agent batch runs (status Agent running).
Anonymous or named feedback from sim and admin pages for this product.