This page explains how to format a dataset for supervised fine-tuning of a Gemini model. You can tune text, image, audio, and document data types.
This page covers the following topics:
- About supervised fine-tuning datasets: Learn about the purpose of training and validation datasets.
- Dataset format: Choose between using a managed Vertex AI dataset or the JSONL format.
- Dataset structure and fields: Review the required structure and parameters for your data.
- Best practices for data creation: Learn how to create high-quality tuning data.
- Upload the dataset: Upload your prepared dataset file to a Cloud Storage bucket.
About supervised fine-tuning datasets
A supervised fine-tuning dataset adapts a pre-trained model to a specific task or ___domain. The input data should be similar to what you expect the model to encounter in real-world use, and the output labels should represent the correct answers for each input.
A complete dataset includes a training dataset. We recommend that you also include a validation dataset:
- Training dataset: The data used to tune the model. For best results, start with at least 100 examples and increase the dataset size if needed. The quality of your data is more important than the quantity.
- Validation dataset: A dataset used to measure the effectiveness of a tuning job. We recommend that you provide a validation dataset.
For limitations on datasets, such as maximum token counts and file sizes, see About supervised fine-tuning for Gemini models.
Dataset format
You can provide your tuning dataset in one of the following formats:
Option | Description | Use Case |
---|---|---|
Multimodal dataset on Vertex AI (preview) | A managed dataset in Vertex AI that supports various data types and provides data management features. For more information, see Multimodal dataset on Vertex AI. | Recommended for managing large or complex multimodal datasets within the Google Cloud ecosystem. |
JSON Lines (JSONL) | A text file where each line is a separate JSON object representing a single training example. The file must be uploaded to a Cloud Storage bucket. | A simple, flexible format suitable for text-based or simple multimodal tasks, especially when data is prepared outside of Google Cloud. |
Dataset structure and fields
This section describes the JSONL data format. Each line in the JSONL file is a single training example with the following structure:
{
"systemInstruction": {
"role": string,
"parts": [
{
"text": string
}
]
},
"contents": [
{
"role": string,
"parts": [
{
// Union field data can be only one of the following:
"text": string,
"fileData": {
"mimeType": string,
"fileUri": string
}
}
]
}
]
}
Each JSON object can contain the following fields:
systemInstruction
: (Optional) An instruction for the model to steer its behavior, such as "Answer as concisely as possible." See Supported models. Thetext
strings count toward the token limit. Therole
field is ignored.contents
: (Required) A conversation with the model. For single-turn queries, this is a single object. For multi-turn queries, this is a repeated field that contains the conversation history and the latest request. Eachcontent
object contains the following:role
: (Optional) The author of the message. Supported values are:user
: The message is from the user.model
: The message is from the model. This is used to provide context in multi-turn conversations.
parts
: (Required) A list of ordered parts that make up a single message. Each part can have a different IANA MIME type. A part can be one of the following types:text
: A text prompt or code snippet.fileData
: Data stored in a file, specified bymimeType
and afileUri
pointing to a file in Cloud Storage.functionCall
: A call to a function, containing the function's name and parameters. See Function calling.functionResponse
: The result of afunctionCall
, used as context for the model. See Function calling.
tools
: (Optional) A set of tools the model can use to interact with external systems. See Function calling.
For limits on inputs, such as the maximum number of tokens or images, see the model specifications on the Google models page. To compute the number of tokens in your request, see Get token count.
Dataset example
Each conversation example in a tuning dataset is composed of a required
messages
field and an optional context
field.
The messages
field consists of an array of role-content pairs:
- The
role
field refers to the author of the message and is set to eithersystem
,user
, ormodel
. Thesystem
role is optional and can only occur at the first element of the messages list. Theuser
andmodel
roles are required and can repeat in an alternating manner. - The
content
field is the content of the message.
For each example, the maximum token length for context
and messages
combined
is 131,072 tokens. Additionally, each content
field for the model
field shouldn't
exceed 8,192 tokens.
{
"messages": [
{
"role": string,
"content": string
}
]
}
Best practices for data creation
Follow prompt design best practices
Your training data should follow the best practices for prompt design. Each example should provide a detailed description of the task and the desired output format.
Maintain consistency with production data
The examples in your datasets should match your expected production traffic. If your dataset contains specific formatting, keywords, instructions, or information, the production data should be formatted in the same way and contain the same instructions.
For example, if the examples in your dataset include a "question:"
and a
"context:"
, production traffic should also be formatted to include a
"question:"
and a "context:"
in the same order as it appears in the dataset
examples. If you exclude the context, the model will not recognize the pattern,
even if the exact question was in an example in the dataset.
Upload the dataset to Cloud Storage
To run a tuning job, you need to upload one or more dataset files to a Cloud Storage bucket.
To run a tuning job, you need to upload one or more datasets to a Cloud Storage bucket. You can either create a new Cloud Storage bucket or use an existing one to store dataset files. The region of the bucket doesn't matter, but we recommend that you use a bucket that's in the same Google Cloud project where you plan to tune your model.
After your bucket is ready, upload your dataset file to the bucket.
What's next
- Choose a region to tune a model.
- To learn how supervised fine-tuning can be used in a solution that builds a generative AI knowledge base, see Jump Start Solution: Generative AI knowledge base.