Vantage - RuntimeAI

Runs

Filters scenario, type, status, client, sort, bulk actions

Select all (this page)

0 selected

Scorecard

Transcript

Details

Run agent batch

Choose scenario and batch API, check models in the grid below, click Estimate Cost, then run the batch.

Type Scenario

Persona —

Client (optional)

Requires a saved Settings → Clients & rubric weights assignment for this scenario. Leave none for global (untagged) runs.

Starting batch…

Turns Avg in / call Avg out / call Replications Batch API Exclude blocked / unavailable

Model cost estimates

All priced OpenRouter models are listed (sort by cost to compare). Only models in the 5 lowest Est. SIM cost tiers can be selected — every model tied at those prices is included (at least five, often more). Check the models you want, click Estimate Cost, then run the batch. Vendor labels come from pricing metadata and do not imply a direct vendor API key unless Settings shows that provider configured.

Search Vendor Type Sort

Pricing

Help: batch API, replications, tokens

Batch API — All checked models are sent through OpenRouter.

Callable via — Shows which vendor APIs on this server list each model id (informational only for RuntimeAI; batches always use OpenRouter).

Turns, replications, token averages — Fixed for RuntimeAI (8 turns, 1 replication, 1500 in / 350 out per call). They feed the estimated batch cost line in the panel above.

Preflight — For smaller selections, the server sends a short test call per model before starting runs. If your account has no API credits, every model can fail preflight and the batch will not start; add credits or switch the batch API / keys.

Evaluate models by scenario

Type Scenario Client

Pick a scenario, then Load aggregate. Scenario results (summary table, chart, and per-model details) always appear in the section below—not inside the folded panel. Rubric is shown on a 0–10 view (2× the underlying 0–5 scoreable-slice aggregate). Fold open Methodology only for the scoring rules and data-source bullets.

Methodology Scoring rules & data sources · click to expand

Scenario results

Batch runs

Type Scenario Client

All scenarios lists every model-test batch. Pick a single scenario to narrow the list. Use Client to show only batches tagged for that client, or all. When a specific client is selected and the scenario filter matches the assignment, headline scores in batch cards use that client’s weights; per-dimension cells stay unweighted. The Scenario control under Model Costs / Batch Run stays the concrete run target (used when you click Run agent batch).

Use the checkbox on each batch to select it, then Delete selected batches. Click the batch title row (not the checkbox) to expand. Dig in sections hold the comparison table and per-model rows. Expand all opens every panel.

How to read these results

Each card below is either an admin “Run agent batch” (shared batch id) or a single run from the main sim (AI agent mode, no batch id). Both use the same automated rubric (total_25 → mean /10). Scoreable means the run ended with enough protagonist ↔ counterpart turns to compare scores fairly.

Model registry

Show registry columns (Working / Non-functional / Untested)

Maintained from observed agent check-ride runs (admin batch + main sim AI agent) and the model universe in model-costs.json.

Working (scoreable discourse)

Non-functional (errors / empty / incompatible)

Not yet tested

Scoreboard

Efficiency rankings — rubric quality ÷ latency ÷ cost, with replication-run statistics (N, rubric σ, latency p95, cost variance). For deep rubric breakdowns, charts, and batch history use Model Test Results instead.

Type Scenario Client

Loading scoreboard…

Please wait — this can take 10–20 seconds while we score every model against this scenario.

Clients & rubric weights

Add a client, then assign a scenario and relative weights for the five automated rubric dimensions (defaults are equal). Under Model Costs / Batch Run, you can tag Run agent batch with a client so Model Test Results can filter runs and apply weighted headline scores for that client. Each scenario needs its own saved row (same client + scenario id as the batch). Below each client, saved rows show blend % by metric; Edit loads that row into the form.

New client name

Edit or add weights (one saved row per client + scenario; other rows stay unchanged)

Client Type Scenario

Each slider is 0–10 emphasis (relative only). The line below is your blend as % of 100 (what gets saved). Absolute totals do not affect scoring.

Evidence

Grounds claims in verifiable facts, not guesses or hand-waving.

Intake

Clarifies problem, constraints, and context before committing.

Humanity

Empathy, tone, pushback, and fit with the counterpart’s needs.

Clarity

Clear structure and actionable next steps.

Self-correction

Notices and fixes weak or mistaken own moves as the thread evolves.

Only the row for the client and scenario chosen above is written. To add another scenario for the same client, change the scenario dropdown and save again. Use Edit in the table to reopen a row without touching others.

LLM API access

Keys are read from .env (or *_FILE variables) next to server.py. Restart uvicorn after editing.

Per-provider snapshot from your server keys (keys never appear in the browser). OpenAI shows identity and billing hints when the key can reach OpenAI’s billing endpoints; Anthropic and OpenRouter balances are not exposed on standard API routes.

Loading…

Provider cards refresh when you open this tab or use Model Costs / Batch Run → Refresh pricing.

Storage debug

Snapshot loads when you click Individual Results → Refresh (same as the primary admin refresh).

Human sessions

Ends every human run still In progress or Pending verification (sets status to Completed). Does not stop agent batch runs (status Agent running).

User feedback

Anonymous or named feedback from sim and admin pages for this product.