Skip to main content
Valar optimizes for throughput and cost per token, not the latency of one turn. In agentic workloads the number that matters is how long a full trajectory takes end to end, not the round-trip of a single call. A completion window tells Valar how much wall-clock time per turn your workload can tolerate, and you pay less the more time you can give it. For how windows fit alongside realtime, async, and batch execution, see Inference modes.

Set the window

Pass metadata.completion_window on the request:
response = client.responses.create(
    model="zai-org/GLM-5.1-FP8",
    input="Explain the key ideas behind transformers.",
    background=True,
    metadata={
        "completion_window": "standard"
    }
)
Accepted values are "asap" (the Now tier) and "standard".

The two tiers

WindowAvg. turn timePriceBest for
Now (asap)ImmediateHigherRealtime calls, interactive UIs, human-in-the-loop
Standard (standard)~5 minLowerCost-optimized agents, async loops, batch work
Actual turn times vary with the specific workload and the model you choose.
Each model-and-window price pairing is listed on the Pricing page.

How each tier behaves

Now runs immediately on the fastest available hardware in a latency-optimized setup, at the higher on-demand rate. Use it for realtime, interactive requests where a person or another system is waiting on the result. Standard is the default wherever a model supports it. It runs on Valar’s maximum-efficiency serving stack and targets roughly a five-minute average turn time across a balanced workload. Most of Valar’s published prices reference this window, and it’s the right default for async agent loops and batch jobs.

Default behavior

Leave completion_window off and the request defaults to standard when the model supports it; otherwise it falls back to the Now tier.
Explicitly choosing a window the model does not support fails with 400 invalid_request_error. The error message lists the model’s supported windows, and the Pricing page keeps a current support matrix.