Product Overview & Testing Guide

A standalone reference for AI, Product and Engineering teams stress-testing the RuntimeAI sandbox — scenarios, rubrics, model comparison, and admin visibility — without platform login.

1. Product Outline: What is RuntimeAI?

Vantage RuntimeAI is a continuous behavior evaluation platform for production-grade AI agents. It is built specifically for teams whose evaluation pipelines have outgrown internal DIY scripts or static spreadsheets, and who need deep visibility into how multi-turn agents execute complex workflows before going live.

Core Capabilities

Scope Boundaries

2. Step-by-Step Testing Guide

This preliminary sandbox environment is fully decoupled and accessible via the production URL. Follow these four distinct workflows to stress-test the core engine.

  1. Execute a Single Agent Check-Ride

    Action

    • Navigate directly to the live agent simulation terminal: /runtimeai/sim
    • Use the dropdown toolbar to select one of the pre-built scenarios, such as Support Escalation, Discovery Call, or Bug Triage Initial Screening for Product Backlog.
    • Select an OpenRouter model from the sampled list available in the dropdown and set a turn budget (1–24 turns).
    • Click “Start Sim” to watch the primary agent and the adversarial counterpart alternate roles in a live, automated Slack-style DM thread.

    What to evaluate

    Observe how the primary agent maintains long-turn coherence, handles boundaries, or triggers escalation paths as the turn budget runs down.

  2. Run a Side-by-Side Model Comparison

    Action

    • Inside the simulation terminal, toggle the “Compare 2 models” configuration checkbox.
    • Select two different OpenRouter models from the sampled dropdown list to run in parallel against the exact same scenario framework.
    • Trigger the simulation and review the dual-column, side-by-side live transcripts.

    What to evaluate

    Verify if the layout makes prompt regression testing easier and allows you to instantly spot behavioral discrepancies between different model updates.

  3. Grade the Heuristic Scorecard Output

    Action

    • Allow a simulation run to complete naturally, or manually click “End Sim” to halt the thread.
    • Review the automated scorecard modal, which grades performance across 5 scenario-specific dimensions scored 0–5 each (scaled to a final /10 rating).
    • Look closely at the performance bands (Strong, Solid, Developing, Weak) and the transcript phrase highlights flagging failure patterns.
    • Test exporting the results by downloading the completed PDF scorecard artifact.

    What to evaluate

    Assess whether these deterministic, rule-based rubrics provide an objective, repeatable baseline that you trust more than a subjective LLM judge.

  4. Explore Administrative & Batch Operations

    Action

    • Navigate to the administrative dashboard console: /runtimeai/admin
    • Open the “Individual Results” tab, expand a completed run row, and use the “Ask about this run” panel to run direct LLM Q&A prompts over the transcript text.
    • Navigate to the “Model Costs / Batch Run” tab to inspect the baseline pricing grid and review the interface used to execute multi-model batch sweeps.
    • Review the “Model Test Results” tab to evaluate how the engine aggregates mean rubric scores and renders comparative radar or bar charts across scenarios.

    What to evaluate

    Determine if this administrative shell gives a product manager enough aggregated visibility to make concrete, cost-justified model migration decisions.

3. Initial Feedback Benchmarks

As you test the sandbox, we are specifically looking for your unvarnished product perspective on these three structural elements:

  1. Rubric Usefulness: Do the scenario-specific rubric dimensions (like Diagnostic Intake or Qualification Judgment) capture the actual qualitative metrics you care about as a PM?
  2. UI vs. Pipeline Workflow: Since this preliminary version is completely UI-driven, what are your team’s hard requirements for moving this engine into a programmatic CI/CD webhook or CLI infrastructure layer?
  3. Custom Authoring: Try pasting a custom natural-language brief into the “+ Create new scenario…” generator—does the resulting draft system prompt accurately scope the vertical boundaries your business depends on?

4. Share your feedback

Use the form below after testing the sandbox, or email simon@vantageai.cc directly.

Prefer email? simon@vantageai.cc