The three modes
Realtime
A normal synchronous request that returns the result immediately. You send the call withoutbackground and read the output from the response. This is the lowest-latency path, finishing in seconds, and it runs on the Now completion window.
Realtime works across the Responses API (/v1/responses), Chat Completions (/v1/chat/completions), and Messages (/v1/messages). Use it for interactive chat, prototyping, and human-in-the-loop steps.
Async
Setbackground=True on the Responses API. The create call returns a response id immediately, then you poll the retrieve endpoint or receive a webhook when the work finishes. Async runs typically finish in minutes with higher throughput and lower cost, especially on the Standard window.
Set the pace with metadata.completion_window: async jobs usually run on the Standard window (standard), the lower-cost default that targets roughly five minutes per turn. Use async for agent loops, background jobs, and large fan-out. See Sending requests at scale and Completion windows.
Batch
Coming soon. Batch processing isn’t generally available yet.
background=True — see Sending requests at scale.
Compare the modes
| Mode | How you call it | Typical latency | Cost | Best for |
|---|---|---|---|---|
| Realtime | Synchronous request, no background | Seconds | Highest | Interactive chat, prototyping, human-in-the-loop |
| Async | Responses API with background=True | Minutes | Lower | Agent loops, background jobs, large fan-out |
| Batch (coming soon) | Batches API, retrieve on completion | Up to hours | Lowest | Large datasets, evals, offline transforms |
How modes relate to completion windows
A completion window sets how much wall-clock time per turn your workload can tolerate, and you pay less the more time you give it. There are exactly two windows:- Now — immediate, on-demand, at a higher rate. API value
asap. - Standard — the default, targeting roughly 5 minutes, at a lower rate. API value
standard.
Next steps
Quickstart
Send your first request with an OpenAI-compatible client.
Sending requests at scale
Fan out async and batch work across many requests.
Completion windows
Trade turn latency for cost per token.