Introduction

Most inference APIs were built for chat: one prompt, one response, optimized for the latency of a single call. But a growing share of AI work is asynchronous. Agents, batch jobs, labeling, data extraction, document processing, evaluation, and background workflows often run across hundreds or thousands of model calls. For these workloads, the goal is not the fastest first token - it is finishing the full job reliably, efficiently, and at the lowest possible cost.

How requests work

Valar speaks the OpenAI Responses API at /v1/responses, reachable under the base URL https://api.valarhq.ai/v1. Any OpenAI-compatible client works once you repoint its base URL and key. Because agent work is rarely interactive, the API leans on a background mode. Send background: true and the request returns a response id straight away instead of blocking; you then retrieve that id until the job reports completed. Pair that with completion windows to trade latency for price, and you can keep large batches in flight without holding open a connection per request.

Authenticate with Authorization: Bearer <key>. Generate keys from the dashboard at app.valarhq.ai, and store them in the VALAR_API_KEY environment variable so the SDK and OpenAI clients pick them up automatically.

Choose your execution mode

How soon you need each result decides how you call Valar. Inference modes covers this in depth; in short:

Mode	How you call it	Latency	Cost	Best for
Realtime	Synchronous request on the Now window	Seconds	Highest	Interactive chat, prototyping
Async	`background=True`, then poll or wait on a webhook, on the Standard window	Minutes	Lower	Agent loops, background jobs
Batch	The Batches API on the Standard window	Up to hours	Lowest	Large datasets, evals, offline jobs

Where to go next

Quickstart

Make your first API request with Valar

Quickstart

⌘I

​How requests work

​Choose your execution mode

​Where to go next

Quickstart

How requests work

Choose your execution mode

Where to go next