Vantage RuntimeAI

1. Product Outline: What is RuntimeAI?

Vantage RuntimeAI is a continuous behavior evaluation platform for production-grade AI agents. It is built specifically for teams whose evaluation pipelines have outgrown internal DIY scripts or static spreadsheets, and who need deep visibility into how multi-turn agents execute complex workflows before going live.

Core Capabilities

Realistic Scenario Libraries: Instead of testing single golden-output passes, the engine runs agents through dynamic, multi-turn conversational templates mapping directly to real failure vectors.
Deterministic Heuristic Rubrics: To avoid the high latency, variable costs, and unpredictability of using an LLM to judge another LLM, scorecards rely strictly on auditable keyword and structural heuristics.
Side-by-Side Model Diffing: The platform allows operators to link parallel agent simulations by a comparison ID to instantly isolate exactly how a prompt change or model upgrade impacts conversational logic.
Non-Engineering Readability: RuntimeAI captures intermediate agent steps, reasoning paths, and tool calls, translating raw JSON logs into structured performance bands and scorecards that product leaders can digest without a translator.
Custom scenario authoring: From the sim toolbar, choose “+ Create new scenario…” and paste a brief plus optional schema, SQL, policy, or API context — see Scenario assist. Draft generation takes about 30–60 seconds; run instantly from the scenario dropdown.

Scope Boundaries

What it is: A specialized behavioral evaluation framework for production agents.
What it isn’t: It does not replace live production observability tools or distributed trace loggers like LangSmith or Datadog, nor does it serve as an API model gateway.

2. Adoption surfaces

Pick the surface that matches how you want to evaluate — all six share the same scenario catalog and deterministic rubrics. See /runtimeai/docs for method landing pages and install snippets.

Preflight
Open question → scenario picks → cost forecast → funded check-ride on your context before production spend.
Editor / MCP
pip install runtimeai-ide — forecast, suggest, generate, and preflight from Cursor or the terminal; run local check-rides with vantage-core.
Simulator
Browser sandbox for live multi-turn agent check-rides, side-by-side model comparison, and in-session scorecards.
API
Self-serve HTTP API — request a key, quote cost, submit parallel evaluations (up to five models), and poll for scorecards. Requires RUNTIMEAI_API_ENABLED=1 on the host.
vantage-core
Open-source Python SDK for local multi-turn simulations with exit-code CI gates in GitHub Actions — no cloud account required.
Console
Run library, batch model sweeps, aggregated test charts, and per-run scorecard / transcript review for your account.

3. Step-by-Step Testing Guide

This sandbox is fully decoupled and accessible via the production URL. Follow these six distinct workflows to stress-test the core engine — including the programmatic API path.

Run a Preflight on your open question
Action
- Open /runtimeai/preflight and describe the decision in plain language (what if / how could we / what would it cost to).
- Paste workflow notes, policy, or SQL context so the check-ride matches your stack.
- Click Generate scenario — review picks, forecast, and per-model scores before any production spend.
- From Cursor: runtimeai-ide preflight --question "…" --file path/to/context.sql (same flow as the web UI).
What to evaluate

Does the brief and scenario match your real decision? Are pass-rate and cost clear enough to choose a model before rollout?
Wire Editor / MCP in Cursor
Action
- pip install runtimeai-ide and follow Editor / MCP guide.
- Add the MCP server to ~/.cursor/mcp.json, restart Cursor, run runtimeai_doctor.
- Try runtimeai_suggest_scenario and runtimeai_forecast_cost on an open SQL file — no API key required.
What to evaluate

Can your team get scenario + cost answers from the editor without cloning the Vantage monorepo?
Execute a Single Agent Check-Ride
Action
- Navigate directly to the live agent simulation terminal: /runtimeai/sim
- Use the dropdown toolbar to select one of the pre-built scenarios, such as Support Escalation, Billing Refund Dispute, Discovery Call, or Bug Triage Initial Screening for Product Backlog — or + Create new scenario… to author your own from a brief (about 30–60 seconds, then run instantly).
- Select an OpenRouter model from the sampled list available in the dropdown and set agent turns per eval run (1–24) — how many back-and-forth conversation steps the agent may take before the scenario closes or the limit is reached.
- Click “Start Sim” to watch the primary agent and the adversarial counterpart alternate roles in a live, automated Slack-style DM thread.
What to evaluate

Observe how the primary agent maintains long-turn coherence, handles boundaries, or triggers escalation paths as agent turns run down toward the limit.
Run a Side-by-Side Model Comparison
Action
- Inside the simulation terminal, toggle the “Compare 2 models” configuration checkbox.
- Select two different OpenRouter models from the sampled dropdown list to run in parallel against the exact same scenario framework.
- Trigger the simulation and review the dual-column, side-by-side live transcripts.
What to evaluate

Verify if the layout makes prompt regression testing easier and allows you to instantly spot behavioral discrepancies between different model updates.
Grade the Heuristic Scorecard Output
Action
- Allow a simulation run to complete naturally, or manually click “End Sim” to halt the thread.
- Review the automated scorecard modal, which grades performance across 5 scenario-specific dimensions scored 0–5 each (scaled to a final /10 rating).
- Look closely at the performance bands (Strong, Solid, Developing, Weak) and the transcript phrase highlights flagging failure patterns.
- Test exporting the results by downloading the completed PDF scorecard artifact.
What to evaluate

Assess whether these deterministic, rule-based rubrics provide an objective, repeatable baseline that you trust more than a subjective LLM judge.
Run an Evaluation via the HTTP API
Action
- Open the interactive API docs and playground: /runtimeai/api/
- Request an API key by email (or use the dev key when running locally with DEV_BYPASS_AUTH).
- In the playground, run the flow: health check (no key) → quote → submit evaluation → poll status until complete.
- Try a multi-model request (up to five models on one scenario) and download JSON or PDF scorecards from the poll response.
- Sign in with your email (profile icon) and confirm the same runs appear under Runs (/runtimeai/admin).
What to evaluate

Is the quote → submit → poll flow clear enough for CI or scripting? Do scorecards and comparison output match what you saw in the Sim UI?
Gate regressions with vantage-core in CI
Action
- Clone the open-source SDK: github.com/vantage-ai-eng/vantage-core
- Follow the README to run a local multi-turn simulation against your agent endpoint using your own API keys and environment variables.
- Wire the exit-code gate into GitHub Actions (or your CI runner) so prompt or model changes fail the build when rubric scores drop below your threshold.
What to evaluate

Does the local SDK give engineers a fast inner loop before they promote changes to cloud batch evals via the HTTP API?
Explore Console & Batch Operations
Action
- Navigate to the Console: /runtimeai/admin
- Open the “Individual Results” tab, expand a completed run row, and use the “Ask about this run” panel to run direct LLM Q&A prompts over the transcript text.
- Navigate to the “Model Costs / Batch Run” tab to inspect the baseline pricing grid and review the interface used to execute multi-model batch sweeps.
- Review the “Model Test Results” tab to evaluate how the engine aggregates mean rubric scores and renders comparative radar or bar charts across scenarios.
What to evaluate

Determine if the Console gives a product manager enough aggregated visibility to make concrete, cost-justified model migration decisions.
Create a Scenario from Your Brief or Context
Action
- On /runtimeai/sim, open the scenario dropdown and select + Create new scenario… — or read the full Scenario assist guide.
- Paste a natural-language brief (10+ characters) describing roles, situation, and what to test.
- Optionally expand Optional context and paste or upload schema DDL, sample SQL, refund policy, or OpenAPI routes (40+ characters per block).
- Click Generate draft — this usually takes 30–60 seconds — then review suggested family, rubric dimensions, title, roles, and briefing.
- Click Save & use. The new scenario appears in the dropdown immediately; select it and run the sim without leaving the page.
What to evaluate

For task-execution context, does the draft embed concrete fixtures from your schema? For policy context, do rubric hints reflect your boundary language? Is the turnaround fast enough for ad-hoc scenarios during a product review?

4. Initial Feedback Benchmarks

As you test the sandbox, we are specifically looking for your unvarnished product perspective on these three structural elements:

Rubric Usefulness: Do the scenario-specific rubric dimensions (like Diagnostic Intake or Qualification Judgment) capture the actual qualitative metrics you care about as a PM?
Sim vs. API workflow: Try both the Sim UI and the HTTP API — what is still missing for CI/CD, webhooks, or a CLI layer your team would adopt?
Custom Authoring: Try + Create new scenario… on the sim toolbar — brief plus optional schema or policy context (Scenario assist). Does the draft embed your fixtures and rubric hints accurately?

5. Share your feedback

Use the form below after testing the sandbox, or email hello@vantageai.cc directly.

Prefer email? hello@vantageai.cc

1. Product Outline: What is RuntimeAI?

Core Capabilities

Scope Boundaries

2. Adoption surfaces

3. Step-by-Step Testing Guide

Run a Preflight on your open question

Wire Editor / MCP in Cursor

Execute a Single Agent Check-Ride

Run a Side-by-Side Model Comparison

Grade the Heuristic Scorecard Output

Run an Evaluation via the HTTP API

Gate regressions with vantage-core in CI

Explore Console & Batch Operations

Create a Scenario from Your Brief or Context

4. Initial Feedback Benchmarks

5. Share your feedback