Skip to main content
Most inference APIs were built for chat: one prompt, one response, optimized for the latency of a single call. But a growing share of AI work is asynchronous. Agents, batch jobs, labeling, data extraction, document processing, evaluation, and background workflows often run across hundreds or thousands of model calls. For these workloads, the goal is not the fastest first token - it is finishing the full job reliably, efficiently, and at the lowest possible cost.

How requests work

Valar speaks the OpenAI Responses API at /v1/responses, reachable under the base URL https://api.valarhq.ai/v1. Any OpenAI-compatible client works once you repoint its base URL and key. Because agent work is rarely interactive, the API leans on a background mode. Send background: true and the request returns a response id straight away instead of blocking; you then retrieve that id until the job reports completed. Pair that with completion windows to trade latency for price, and you can keep large batches in flight without holding open a connection per request.
Authenticate with Authorization: Bearer <key>. Generate keys from the dashboard at app.valarhq.ai, and store them in the VALAR_API_KEY environment variable so the SDK and OpenAI clients pick them up automatically.

Choose your execution mode

How soon you need each result decides how you call Valar. Inference modes covers this in depth; in short:
ModeHow you call itLatencyCostBest for
RealtimeSynchronous request on the Now windowSecondsHighestInteractive chat, prototyping
Asyncbackground=True, then poll or wait on a webhook, on the Standard windowMinutesLowerAgent loops, background jobs
BatchThe Batches API on the Standard windowUp to hoursLowestLarge datasets, evals, offline jobs

Where to go next

Quickstart

Make your first API request with Valar