Inference modes

Valar optimizes for throughput and cost on long-running agent work, not the latency of a single call. You pick an execution mode by how soon you need each result: synchronously in seconds, asynchronously in minutes, or in bulk over hours.

The three modes

Realtime

A normal synchronous request that returns the result immediately. You send the call without background and read the output from the response. This is the lowest-latency path, finishing in seconds, and it runs on the Now completion window. Realtime works across the Responses API (/v1/responses), Chat Completions (/v1/chat/completions), and Messages (/v1/messages). Use it for interactive chat, prototyping, and human-in-the-loop steps.

import os
from openai import OpenAI

client = OpenAI(
    base_url="https://api.valarhq.ai/v1",
    api_key=os.environ["VALAR_API_KEY"],
)

response = client.responses.create(
    model="moonshotai/Kimi-K2.6",
    input="Summarize the latest support ticket in one sentence.",
)

print(response.output_text)

Async

Set background=True on the Responses API. The create call returns a response id immediately, then you poll the retrieve endpoint or receive a webhook when the work finishes. Async runs typically finish in minutes with higher throughput and lower cost, especially on the Standard window. Set the pace with metadata.completion_window: async jobs usually run on the Standard window (standard), the lower-cost default that targets roughly five minutes per turn. Use async for agent loops, background jobs, and large fan-out. See Sending requests at scale and Completion windows.

started = client.responses.create(
    model="moonshotai/Kimi-K2.6",
    input="Classify this ticket and draft a reply.",
    background=True,  # returns a response id right away
    metadata={"completion_window": "standard"},
)

print("Queued:", started.id)

Batch

Coming soon. Batch processing isn’t generally available yet.

Batch will let you submit many requests at once and retrieve the results when the set completes — the lowest cost and highest throughput, with the longest turnaround. It will run on the Standard window and suit large datasets, evals, and offline transforms. Until it ships, run large workloads asynchronously with background=True — see Sending requests at scale.

Compare the modes

Mode	How you call it	Typical latency	Cost	Best for
Realtime	Synchronous request, no `background`	Seconds	Highest	Interactive chat, prototyping, human-in-the-loop
Async	Responses API with `background=True`	Minutes	Lower	Agent loops, background jobs, large fan-out
Batch (coming soon)	Batches API, retrieve on completion	Up to hours	Lowest	Large datasets, evals, offline transforms

How modes relate to completion windows

A completion window sets how much wall-clock time per turn your workload can tolerate, and you pay less the more time you give it. There are exactly two windows:

Now — immediate, on-demand, at a higher rate. API value asap.
Standard — the default, targeting roughly 5 minutes, at a lower rate. API value standard.

Realtime uses the Now window. Async uses the Standard window for lower cost (and Batch will, once available). See Completion windows and Pricing.

Next steps

Quickstart

Send your first request with an OpenAI-compatible client.

Sending requests at scale

Fan out async and batch work across many requests.

Completion windows

Trade turn latency for cost per token.

​The three modes

​Realtime

​Async

​Batch

​Compare the modes

​How modes relate to completion windows

​Next steps

Quickstart

Sending requests at scale

Completion windows

The three modes

Realtime

Async

Batch

Compare the modes

How modes relate to completion windows

Next steps