1. Product Outline: What is RuntimeAI?
Vantage RuntimeAI is a continuous behavior evaluation platform for production-grade AI agents. It is built specifically for teams whose evaluation pipelines have outgrown internal DIY scripts or static spreadsheets, and who need deep visibility into how multi-turn agents execute complex workflows before going live.
Core Capabilities
- Adversarial Scenario Libraries: Instead of testing single golden-output passes, the engine runs agents through dynamic, multi-turn conversational templates mapping directly to real failure vectors.
- Deterministic Heuristic Rubrics: To avoid the high latency, variable costs, and unpredictability of using an LLM to judge another LLM, scorecards rely strictly on auditable keyword and structural heuristics.
- Side-by-Side Model Diffing: The platform allows operators to link parallel agent simulations by a comparison ID to instantly isolate exactly how a prompt change or model upgrade impacts conversational logic.
- Non-Engineering Readability: RuntimeAI captures intermediate agent steps, reasoning paths, and tool calls, translating raw JSON logs into structured performance bands and scorecards that product leaders can digest without a translator.
Scope Boundaries
- What it is: A specialized behavioral evaluation framework for production agents.
- What it isn’t: It does not replace live production observability tools or distributed trace loggers like LangSmith or Datadog, nor does it serve as an API model gateway.
2. Step-by-Step Testing Guide
This preliminary sandbox environment is fully decoupled and accessible via the production URL. Follow these four distinct workflows to stress-test the core engine.
-
Execute a Single Agent Check-Ride
Action
- Navigate directly to the live agent simulation terminal: /runtimeai/sim
- Use the dropdown toolbar to select one of the pre-built scenarios, such as Support Escalation, Discovery Call, or Bug Triage Initial Screening for Product Backlog.
- Select an OpenRouter model from the sampled list available in the dropdown and set a turn budget (1–24 turns).
- Click “Start Sim” to watch the primary agent and the adversarial counterpart alternate roles in a live, automated Slack-style DM thread.
What to evaluate
Observe how the primary agent maintains long-turn coherence, handles boundaries, or triggers escalation paths as the turn budget runs down.
-
Run a Side-by-Side Model Comparison
Action
- Inside the simulation terminal, toggle the “Compare 2 models” configuration checkbox.
- Select two different OpenRouter models from the sampled dropdown list to run in parallel against the exact same scenario framework.
- Trigger the simulation and review the dual-column, side-by-side live transcripts.
What to evaluate
Verify if the layout makes prompt regression testing easier and allows you to instantly spot behavioral discrepancies between different model updates.
-
Grade the Heuristic Scorecard Output
Action
- Allow a simulation run to complete naturally, or manually click “End Sim” to halt the thread.
- Review the automated scorecard modal, which grades performance across 5 scenario-specific dimensions scored 0–5 each (scaled to a final /10 rating).
- Look closely at the performance bands (Strong, Solid, Developing, Weak) and the transcript phrase highlights flagging failure patterns.
- Test exporting the results by downloading the completed PDF scorecard artifact.
What to evaluate
Assess whether these deterministic, rule-based rubrics provide an objective, repeatable baseline that you trust more than a subjective LLM judge.
-
Explore Administrative & Batch Operations
Action
- Navigate to the administrative dashboard console: /runtimeai/admin
- Open the “Individual Results” tab, expand a completed run row, and use the “Ask about this run” panel to run direct LLM Q&A prompts over the transcript text.
- Navigate to the “Model Costs / Batch Run” tab to inspect the baseline pricing grid and review the interface used to execute multi-model batch sweeps.
- Review the “Model Test Results” tab to evaluate how the engine aggregates mean rubric scores and renders comparative radar or bar charts across scenarios.
What to evaluate
Determine if this administrative shell gives a product manager enough aggregated visibility to make concrete, cost-justified model migration decisions.
3. Initial Feedback Benchmarks
As you test the sandbox, we are specifically looking for your unvarnished product perspective on these three structural elements:
- Rubric Usefulness: Do the scenario-specific rubric dimensions (like Diagnostic Intake or Qualification Judgment) capture the actual qualitative metrics you care about as a PM?
- UI vs. Pipeline Workflow: Since this preliminary version is completely UI-driven, what are your team’s hard requirements for moving this engine into a programmatic CI/CD webhook or CLI infrastructure layer?
- Custom Authoring: Try pasting a custom natural-language brief into the “+ Create new scenario…” generator—does the resulting draft system prompt accurately scope the vertical boundaries your business depends on?
4. Share your feedback
Use the form below after testing the sandbox, or email simon@vantageai.cc directly.
Prefer email? simon@vantageai.cc