This guide shows how to send requests to Anthropic's Claude models using Vertex AI, covering the following topics: The following diagram summarizes the overall workflow:
To use the Anthropic Claude models with Vertex AI, perform the
following steps. You must enable the Vertex AI API ( Make sure that you have the required permissions to enable and use partner models.
For more information, see Grant the required permissions. In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Verify that billing is enabled for your Google Cloud project.
Enable the Vertex AI API.
In the Google Cloud console, on the project selector page,
select or create a Google Cloud project.
Verify that billing is enabled for your Google Cloud project.
Enable the Vertex AI API.
Anthropic recommends that you enable 30-day logging of your prompt and
completion activity to record any model misuse. To enable logging, see Log
requests and responses. You can send requests to Claude models in several ways. The following table provides an overview of the available options to help you decide which is best for your use case. You can use Anthropic's SDK or curl commands to send requests to the
Vertex AI endpoint using the following model names:
Anthropic Claude model versions must be used with a suffix that starts with an
You can make API requests to the Anthropic Claude models using the Anthropic
Claude SDK. To learn more, see the
following: The following code sample uses the Anthropic Vertex SDK to perform a streaming call to a Claude model. To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python.
For more information, see the
Python API reference documentation.
The following code sample uses the Anthropic Vertex SDK to perform a unary call to a Claude model. To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python.
For more information, see the
Python API reference documentation.
You can use a curl command to send a request to the Vertex AI endpoint for a specific Claude model.
Anthropic Claude model versions must be used with a suffix that starts with an
The following section shows you how to create a curl command and includes a sample
curl command. To test a text prompt by using the Vertex AI API, send a POST request to the
publisher model endpoint.
Before using any of the request data,
make the following replacements:
Specify a lower value for shorter responses and a higher value for potentially longer
responses. Specify a lower value for less random responses and a higher value for more
random responses. For each token selection step, the top-K tokens with the highest
probabilities are sampled. Then tokens are further filtered based on top-P with
the final token selected using temperature sampling. Specify a lower value for less random responses and a higher value for more
random responses.
HTTP method and URL:
Request JSON body:
To send your request, choose one of these options:
Save the request body in a file named
Save the request body in a file named You should receive a JSON response similar to the following. The Anthropic Claude models support tool use and function calling to enhance a
model's capabilities. For more information, see the
Tool use overview in the Anthropic documentation. The following samples demonstrate how to use tools with an SDK or a curl
command. The samples search for nearby restaurants in San Francisco that are
open. To learn how to install or update the Vertex AI SDK for Python, see Install the Vertex AI SDK for Python.
For more information, see the
Python API reference documentation.
Before using any of the request data,
make the following replacements:
Specify a lower value for shorter responses and a higher value for potentially longer
responses.
HTTP method and URL:
Request JSON body:
To send your request, choose one of these options:
Save the request body in a file named
Save the request body in a file named You should receive a JSON response similar to the following.
Before you begin
aiplatform.googleapis.com
)
to use Vertex AI. If you already have a project with
the Vertex AI API enabled, you can use that project instead of creating
a new one.
Choose an interaction method
Method
Description
Use Case
Anthropic SDK
Programmatic access using the official Python or TypeScript SDKs.
Integrating Claude models into your applications.
curl command
Send requests directly to the REST API endpoint from your command line.
Quick testing, scripting, and environments without an SDK.
Vertex AI Studio
A web-based UI in the Google Cloud console for interactive prompting.
Rapid prototyping, model comparison, and no-code exploration.
claude-opus-4-1@20250805
claude-opus-4@20250514
claude-sonnet-4@20250514
claude-3-7-sonnet@20250219
claude-3-5-sonnet-v2@20241022
claude-3-5-haiku@20241022
claude-3-5-sonnet@20240620
claude-3-opus@20240229
claude-3-haiku@20240307
@
symbol (such as claude-3-7-sonnet@20250219
or
claude-3-5-haiku@20241022
) to guarantee consistent behavior.
Use the Anthropic SDK
Make a streaming call to a Claude model using the Anthropic Vertex SDK
Python
Make a unary call to a Claude model using the Anthropic Vertex SDK
Python
Use a curl command
@
symbol (such as claude-3-7-sonnet@20250219
or
claude-3-5-haiku@20241022
) to guarantee consistent behavior.
REST
user
or an assistant
.
The first message must use the user
role. Claude models
operate with alternating user
and assistant
turns.
If the final message uses the assistant
role, then the response
content continues immediately from the content in that message. You can use
this to constrain part of the model's response.true
to stream the response and false
to return the response all at
once.user
or
assistant
message.0.5
, then the model will
select either A or B as the next token by using temperature and excludes C as a
candidate.
1
means the next selected token is the most probable among all
tokens in the model's vocabulary (also called greedy decoding), while a top-K of
3
means that the next token is selected from among the three most
probable tokens by using temperature.
enabled
.1024
but less than MAX_TOKENS
.
POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:streamRawPredict
{
"anthropic_version": "vertex-2023-10-16",
"messages": [
{
"role": "ROLE",
"content": "CONTENT"
}],
"max_tokens": MAX_TOKENS,
"stream": STREAM,
"thinking": {
"type": "TYPE",
"budget_tokens": BUDGET_TOKENS
}
}
curl
request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:streamRawPredict"PowerShell
request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:streamRawPredict" | Select-Object -Expand ContentExample curl command
MODEL_ID="MODEL"
LOCATION="us-central1"
PROJECT_ID="PROJECT_ID"
curl \
-X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json" \
https://${LOCATION}-aiplatform.googleapis.com/v1/projects/${PROJECT_ID}/locations/${LOCATION}/publishers/anthropic/models/${MODEL_ID}:streamRawPredict -d \
'{
"anthropic_version": "vertex-2023-10-16",
"messages": [{
"role": "user",
"content": "Hello!"
}],
"max_tokens": 50,
"stream": true}'
Tool use (function calling)
Python
REST
The following sample uses regional endpoints. To use the global endpoint, see Specify
the global endpoint.
user
or an assistant
.
The first message must use the user
role. Claude models
operate with alternating user
and assistant
turns.
If the final message uses the assistant
role, then the response
content continues immediately from the content in that message. You can use
this to constrain part of the model's response.true
to stream the response
and false
to return the response all at once.user
or assistant
message.POST https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:rawPredict
{
"anthropic_version": "vertex-2023-10-16",
"max_tokens": MAX_TOKENS,
"stream": STREAM,
"tools": [
{
"name": "text_search_places_api",
"description": "Returns information about a set of places based on a string",
"input_schema": {
"type": "object",
"properties": {
"textQuery": {
"type": "string",
"description": "The text string on which to search"
},
"priceLevels": {
"type": "array",
"description": "Price levels to query places, value can be one of [PRICE_LEVEL_INEXPENSIVE, PRICE_LEVEL_MODERATE, PRICE_LEVEL_EXPENSIVE, PRICE_LEVEL_VERY_EXPENSIVE]",
},
"openNow": {
"type": "boolean",
"description": "Describes whether a place is open for business at
the time of the query."
},
},
"required": ["textQuery"]
}
}
],
"messages": [
{
"role": "user",
"content": "What are some affordable and good Italian restaurants that are open now in San Francisco??"
}
]
}
curl
request.json
,
and execute the following command:
curl -X POST \
-H "Authorization: Bearer $(gcloud auth print-access-token)" \
-H "Content-Type: application/json; charset=utf-8" \
-d @request.json \
"https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:rawPredict"PowerShell
request.json
,
and execute the following command:
$cred = gcloud auth print-access-token
$headers = @{ "Authorization" = "Bearer $cred" }
Invoke-WebRequest `
-Method POST `
-Headers $headers `
-ContentType: "application/json; charset=utf-8" `
-InFile request.json `
-Uri "https://LOCATION-aiplatform.googleapis.com/v1/projects/PROJECT_ID/locations/LOCATION/publishers/anthropic/models/MODEL:rawPredict" | Select-Object -Expand Content
Use Vertex AI Studio
For some of the Anthropic Claude models, you can use Vertex AI Studio to prototype and test generative AI models in the Google Cloud console. As an example, you can use Vertex AI Studio to compare Claude model responses with other supported models such as Google Gemini.
For more information, see Quickstart: Send text prompts to Gemini using Vertex AI Studio.
Anthropic Claude quotas and region availability
Claude models have regional quotas and, for models that support a global endpoint, a global quota. The quota is specified in queries per minute (QPM) and tokens per minute (TPM). TPM includes both input and output tokens.
Maximum quotas can vary by account to maintain service performance and ensure acceptable use. In some cases, access might be restricted. To view your project's quotas, see the Quotas & Systems Limits page in the Google Cloud console. You also need the following quotas:
online_prediction_requests_per_base_model
andglobal_online_prediction_requests_per_base_model
defines your QPM quota.For TPM, there are three quota values that apply to certain models:
- For models that count input and output tokens together,
online_prediction_tokens_per_minute_per_base_model
andglobal_online_prediction_tokens_per_minute_per_base_model
defines the model TPM quota. - For models that count input and output tokens separately,
online_prediction_input_tokens_per_minute_per_base_model
andglobal_online_prediction_input_tokens_per_minute_per_base_model
defines the input TPM quota andonline_prediction_output_tokens_per_minute_per_base_model
andglobal_online_prediction_output_tokens_per_minute_per_base_model
defines the output TPM quota.
To see which models count input and output tokens separately, see Quotas by model and region.
- For models that count input and output tokens together,
Input
The following list defines the input tokens that can count towards your input TPM quota. The input tokens that each model counts can vary. To see which input tokens a model counts, see Quotas by model and region.
- Input tokens includes all input tokens, including cache read and cache write tokens.
- Uncached input tokens includes only the input tokens that weren't read from a cache (cache read tokens).
- Cache write tokens includes tokens that were used to create or update a cache.
Quotas by model and region
The following table shows the default quotas and supported context length for each model in each region.
Model | Region | Quotas | Context length |
---|---|---|---|
Claude Opus 4.1 | |||
us-east5 |
|
200,000 | |
global endpoint |
|
200,000 | |
Claude Opus 4 | |||
us-east5 |
|
200,000 | |
global endpoint |
|
200,000 | |
Claude Sonnet 4 | |||
us-east5 |
|
200,000 | |
europe-west1 |
|
200,000 | |
asia-east1 |
|
200,000 | |
global endpoint |
|
200,000 | |
Claude 3.7 Sonnet | |||
us-east5 |
|
200,000 | |
europe-west1 |
|
200,000 | |
global endpoint |
|
200,000 | |
Claude 3.5 Sonnet v2 | |||
us-east5 |
|
200,000 | |
europe-west1 |
|
200,000 | |
global endpoint |
|
200,000 | |
Claude 3.5 Haiku | |||
us-east5 |
|
200,000 | |
europe-west1 |
|
200,000 | |
Claude 3.5 Sonnet | |||
us-east5 |
|
200,000 | |
europe-west1 |
|
200,000 | |
asia-southeast1 |
|
200,000 | |
Claude 3 Opus | |||
us-east5 |
|
200,000 | |
Claude 3 Haiku | |||
us-east5 |
|
200,000 | |
europe-west1 |
|
200,000 | |
asia-southeast1 |
|
200,000 |
If you want to increase any of your quotas for Generative AI on Vertex AI, you can use the Google Cloud console to request a quota increase. To learn more about quotas, see Work with quotas.