Today, we are adding two new service tiers to the Gemini API: Flex and Priority. These new options give you granular control over cost and reliability through a single, unified interface.
As AI evolves from simple chat into complex, autonomous agents, developers typically have to manage two distinct types of logic:
- Background tasks: High-volume workflows like data enrichment or “thinking” processes that don’t need instant responses.
- Interactive tasks: User-facing features like chatbots and copilots where high reliability is needed.
Until now, supporting both meant splitting your architecture between standard synchronous serving and the asynchronous Batch API. Flex and Priority help to bridge this gap. You can now route background jobs to Flex and interactive jobs to Priority, both using standard synchronous endpoints. This eliminates the complexity of async job management while giving you the economic and performance benefits of specialized tiers.
Flex Inference: scale innovation for 50% less
Flex Inference is our new cost-optimized tier, designed for latency-tolerant workloads without the overhead of batch processing.
- 50% price savings: Pay half the price of the Standard API by downgrading criticality of your request (making them less reliable, and adding latency).
- Synchronous simplicity: Unlike the Batch API, Flex is a synchronous interface. You use the same familiar endpoints without managing input/output files or polling for job completion.
- Ideal use cases: Background CRM updates, large-scale research simulations, and agentic workflows where the model “browses” or “thinks” in the background.
Get started fast by simply configuring the service_tier parameter in your request:


