Vantage - SimOps

Runs

Filters scenario, type, status, client, sort, bulk actions

Select all (this page)

0 selected

Scorecard

Transcript

Verification

Details

Run agent batch

Choose scenario and batch API, pick models in the Model cost estimates grid below, then start the batch.

Scenario

Persona —

Client (optional)

Requires a saved Settings → Clients & rubric weights assignment for this scenario. Leave none for global (untagged) runs.

Starting batch…

Turns Avg in / call Avg out / call Replications Batch API Route Claude via Anthropic (native) Skip preflight Exclude blocked / unavailable

Model cost estimates

Pick models, then choose which API runs the batch. The table’s Vendor column comes from pricing metadata (e.g. gpt-… rows say OpenAI) — it does not mean your server has an OpenAI key unless Settings shows OpenAI configured. When Callable via lists multiple backends, Est. SIM cost, In $/1M, and Out $/1M use value / value in that same order; ?? means no USD rate for that path in model-costs.json yet (only the route that matches Vendor has numbers today).

Search Vendor Type Sort

Pricing

Help: batch API, native Claude, replications, tokens

Batch API — All checked models are sent to that endpoint by default (OpenRouter and Anthropic; OpenAI appears here only when the server has OPENAI_API_KEY). Pick Anthropic only if every model id exists on Anthropic’s API.

Callable via — Shows which vendor APIs on this server list each model id. The batch panel blocks “Run agent batch” if a selected model’s Callable via column does not include the backend the batch will actually call (given “Batch API” and “Route Claude via Anthropic”).

Route Claude via Anthropic — Optional. When checked, Claude-family ids use ANTHROPIC_API_KEY; everything else still uses the API you picked above. Leave off if you only use OpenRouter or your Anthropic key is wrong—otherwise preflight fails for Claude rows.

Replications — How many full simulations to run per selected model (same scenario). Use 1 for a single run, or raise it to compare variance across runs.

Preflight — For smaller selections, the server sends a short test call per model before starting runs. If your account has no API credits, every model can fail preflight and the batch will not start; add credits, switch the batch API / keys, or enable Skip preflight to enqueue runs anyway (each run may still error if the account cannot call the model).

Token averages — These feed the estimated batch cost line in the panel above (same formula as this grid). Defaults 1500 in and 350 out come from model-costs.json → assumptions (calibration for multi-turn sims; not auto-measured from transcripts). Edit for what-if analysis; Est. SIM cost recomputes and matches the same file the sim uses for cost labels.

Evaluate models by scenario

Scenario Client

Pick a scenario, then Load aggregate. Scenario results (summary table, chart, and per-model details) always appear in the section below—not inside the folded panel. Rubric is shown on a 0–10 view (2× the underlying 0–5 scoreable-slice aggregate). Fold open Methodology only for the scoring rules and data-source bullets.

Methodology Scoring rules & data sources · click to expand

Scenario results

Batch runs

Scenario Client

All scenarios lists every model-test batch. Pick a single scenario to narrow the list. Use Client to show only batches tagged for that client, or all. When a specific client is selected and the scenario filter matches the assignment, headline scores in batch cards use that client’s weights; per-dimension cells stay unweighted. The Scenario control under Model Costs / Batch Run stays the concrete run target (used when you click Run agent batch).

Use the checkbox on each batch to select it, then Delete selected batches. Click the batch title row (not the checkbox) to expand. Dig in sections hold the comparison table and per-model rows. Expand all opens every panel.

How to read these results

Each card below is either an admin “Run agent batch” (shared batch id) or a single run from the main sim (AI agent mode, no batch id). Both use the same automated rubric (total_25 → mean /10). Scoreable means the run ended with enough protagonist ↔ counterpart turns to compare scores fairly.

Model registry

Show registry columns (Working / Non-functional / Untested)

Maintained from observed agent check-ride runs (admin batch + main sim AI agent) and the model universe in model-costs.json.

Working (scoreable discourse)

Non-functional (errors / empty / incompatible)

Not yet tested

Clients & rubric weights

Add a client, then assign a scenario and relative weights for the five automated rubric dimensions (defaults are equal). Under Model Costs / Batch Run, you can tag Run agent batch with a client so Model Test Results can filter runs and apply weighted headline scores for that client. Each scenario needs its own saved row (same client + scenario id as the batch). Below each client, saved rows show blend % by metric; Edit loads that row into the form.

New client name

Edit or add weights (one saved row per client + scenario; other rows stay unchanged)

Client Scenario

Each slider is 0–10 emphasis (relative only). The line below is your blend as % of 100 (what gets saved). Absolute totals do not affect scoring.

Evidence

Grounds claims in verifiable facts, not guesses or hand-waving.

Intake

Clarifies problem, constraints, and context before committing.

Humanity

Empathy, tone, pushback, and fit with the counterpart’s needs.

Clarity

Clear structure and actionable next steps.

Self-correction

Notices and fixes weak or mistaken own moves as the thread evolves.

Only the row for the client and scenario chosen above is written. To add another scenario for the same client, change the scenario dropdown and save again. Use Edit in the table to reopen a row without touching others.

LLM API access

Keys are read from .env (or *_FILE variables) next to server.py. Restart uvicorn after editing.

Per-provider snapshot from your server keys (keys never appear in the browser). OpenAI shows identity and billing hints when the key can reach OpenAI’s billing endpoints; Anthropic and OpenRouter balances are not exposed on standard API routes.

Loading…

Provider cards refresh when you open this tab or use Model Costs / Batch Run → Refresh pricing.

Storage debug

Snapshot loads when you click Individual Results → Refresh (same as the primary admin refresh).

Human sessions

Ends every human run still In progress or Pending verification (sets status to Completed). Does not stop agent batch runs (status Agent running).

User feedback

Anonymous or named feedback from sim and admin pages for this product.