Starting April 29, 2025, Gemini 1.5 Pro and Gemini 1.5 Flash models are not available in projects that have no prior usage of these models, including new projects. For details, see Model versions and lifecycle.

Generative AI on Vertex AI quotas and system limits

This document describes the different ways to use generative AI services and their corresponding quota systems, covering the following topics:

Consumption and quota models: Learn about the different service consumption models and how quotas are managed in each.
Quota system by model: Find out which quota system applies to specific generative AI models.
Feature-specific quotas and limits: Review the quotas and limits for features like model tuning, text embeddings, and batch predictions.
View and request quota increases: Learn how to view your current quota usage and request increases in the Google Cloud console.

Consumption and quota models

You can use generative AI services in two ways: pay-as-you-go (PayGo) or by reserving dedicated capacity with Provisioned Throughput. Depending on the model you use, your PayGo usage is subject to either a standard, per-project quota or a dynamic shared quota system.

The following table provides a comparison of these options.

Option	Description	Pros	Use Case
Pay-as-you-go (PayGo) with Standard Quota	Per-project, per-region request limits for older models.	Predictable limits for individual projects.	Applications using non-Gemini or earlier Gemini models.
Pay-as-you-go (PayGo) with Dynamic Shared Quota (DSQ)	Capacity is dynamically shared among all customers for a specific model and region. There are no per-project quotas to manage.	No need to manage quotas or request increases; flexible and scales automatically with available capacity.	Development, testing, and applications that can tolerate variability in performance. Recommended for most PayGo scenarios with newer models.
Provisioned Throughput	Reserved, dedicated compute capacity for a specific model, paid for in advance.	Dedicated throughput, predictable high performance, and stable latency.	Production workloads requiring high availability and predictable service levels.

Quota system by model

The following models support Dynamic shared quota (DSQ):

Gemini 2.5 Flash-Lite
Gemini 2.0 Flash with Live API
Preview
Gemini 2.0 Flash with image generation
Preview
Gemini 2.5 Pro
Gemini 2.5 Flash
Gemini 2.0 Flash
Gemini 2.0 Flash-Lite

The following legacy Gemini models support DSQ:

Gemini 1.5 Pro
Gemini 1.5 Flash

Non-Gemini and earlier Gemini models use the standard quota system. For more information, see Vertex AI quotas and limits.

Feature-specific quotas and limits

This section describes quotas and limits for specific generative AI features.

Tuned model quotas

Tuned model inference shares the same quota as the base model. There is no separate quota for tuned model inference.

Each tuning job uses Vertex AI Pipelines. For more information, see Vertex AI Pipelines quotas and limits.

Text embedding limits

Each request can have up to 250 input texts (generating 1 embedding per input text) and 20,000 tokens per request. Only the first 2,048 tokens in each input text are used to compute the embeddings. For gemini-embedding-001, the quota is listed under the name gemini-embedding.

Embed content input tokens per minute per base model

Unlike previous embedding models which were primarily limited by RPM quotas, the quota for the Gemini Embedding model limits the number of tokens that can be sent per minute per project.

Quota	Value
Embed content input tokens per minute	5,000,000

Vertex AI Agent Engine limits

The following limits apply to Vertex AI Agent Engine for a given project in each region:

Description	Limit
Create, delete, or update Vertex AI Agent Engine per minute	10
Create, delete, or update Vertex AI Agent Engine sessions per minute	100
`Query` or `StreamQuery` Vertex AI Agent Engine per minute	60
Append event to Vertex AI Agent Engine sessions per minute	300
Maximum number of Vertex AI Agent Engine resources	100
Create, delete, or update Vertex AI Agent Engine memory resources per minute	100
Get, list, or retrieve from Vertex AI Agent Engine Memory Bank per minute	300

Batch prediction

The quotas and limits for batch inference jobs are the same across all regions.

Concurrent batch inference job limits for Gemini models

There are no predefined quota limits on batch inference for Gemini models. Instead, the batch service provides access to a large, shared pool of resources, dynamically allocated based on the model's real-time availability and demand across all customers for that model. When more customers are active and saturated the model's capacity, your batch requests might be queued for capacity.

Concurrent batch inference job quotas non-Gemini models

The following table lists the quotas for the number of concurrent batch inference jobs, which don't apply to Gemini models:

Quota	Value
`aiplatform.googleapis.com/textembedding_gecko_concurrent_batch_prediction_jobs`	4

If the number of tasks submitted exceeds the allocated quota, the tasks are placed in a queue and processed when the quota capacity becomes available.

View and edit the quotas in the Google Cloud console

To view and edit the quotas in the Google Cloud console, do the following:

Go to the Quotas and System Limits page.

Go to Quotas and System Limits

To adjust the quota, copy and paste the property aiplatform.googleapis.com/textembedding_gecko_concurrent_batch_prediction_jobs in the Filter. Press Enter.
Click the three dots at the end of the row, and select Edit quota.
Enter a new quota value in the pane, and click Submit request.

Vertex AI RAG Engine

For each service to perform retrieval-augmented generation (RAG) using RAG Engine, the following quotas apply, with the quota measured as requests per minute (RPM).

Service	Quota	Metric
RAG Engine data management APIs	60 RPM	`VertexRagDataService requests per minute per region`
`RetrievalContexts` API	1,500 RPM	`VertexRagService retrieve requests per minute per region`
`base_model: textembedding-gecko`	1,500 RPM	`Online prediction requests per base model per minute per region per base_model` An additional filter for you to specify is `base_model: textembedding-gecko`

The following limits apply:

Service	Limit	Metric
Concurrent `ImportRagFiles` requests	3 RPM	`VertexRagService concurrent import requests per region`
Maximum number of files per `ImportRagFiles` request	10,000	`VertexRagService import rag files requests per region`

For more rate limits and quotas, see Generative AI on Vertex AI rate limits.

Gen AI evaluation service

The Gen AI evaluation service uses gemini-2.0-flash as a default judge model for model-based metrics. A single evaluation request for a model-based metric might result in multiple underlying requests to the Gen AI evaluation service. Each model's quota is calculated on a per-project basis, which means that any requests directed to gemini-2.0-flash for model inference and model-based evaluation contribute to the quota. Quotas for the Gen AI evaluation service and the underlying judge model are shown in the following table:

Request quota	Default quota
Gen AI evaluation service requests per minute	1,000 requests per project per region
Online prediction requests per minute for `base_model: gemini-2.0-flash`	See Quotas by region and model.

If you receive an error related to quotas while using the Gen AI evaluation service, you might need to file a quota increase request. See View and manage quotas for more information.

Limit	Value
Gen AI evaluation service request timeout	60 seconds

When you use the Gen AI evaluation service for the first time in a new project, you might experience an initial setup delay up to two minutes. If your first request fails, wait a few minutes and then retry. Subsequent evaluation requests typically complete within 60 seconds.

The maximum input and output tokens for model-based metrics depend on the model used as the judge model. See Google models for a list of models.

What's next

To learn more about dynamic shared quota, see Dynamic shared quota.
To learn about quotas and limits for Vertex AI, see Vertex AI quotas and limits.
To learn more about Google Cloud quotas and system limits, see the Cloud Quotas documentation.