Prepare supervised fine-tuning data for Gemini models

This page explains how to format a dataset for supervised fine-tuning of a Gemini model. You can tune text, image, audio, and document data types.

This page covers the following topics:

About supervised fine-tuning datasets

A supervised fine-tuning dataset adapts a pre-trained model to a specific task or ___domain. The input data should be similar to what you expect the model to encounter in real-world use, and the output labels should represent the correct answers for each input.

A complete dataset includes a training dataset. We recommend that you also include a validation dataset:

  • Training dataset: The data used to tune the model. For best results, start with at least 100 examples and increase the dataset size if needed. The quality of your data is more important than the quantity.
  • Validation dataset: A dataset used to measure the effectiveness of a tuning job. We recommend that you provide a validation dataset.

For limitations on datasets, such as maximum token counts and file sizes, see About supervised fine-tuning for Gemini models.

Dataset format

You can provide your tuning dataset in one of the following formats:

Option Description Use Case
Multimodal dataset on Vertex AI (preview) A managed dataset in Vertex AI that supports various data types and provides data management features. For more information, see Multimodal dataset on Vertex AI. Recommended for managing large or complex multimodal datasets within the Google Cloud ecosystem.
JSON Lines (JSONL) A text file where each line is a separate JSON object representing a single training example. The file must be uploaded to a Cloud Storage bucket. A simple, flexible format suitable for text-based or simple multimodal tasks, especially when data is prepared outside of Google Cloud.

Dataset structure and fields

This section describes the JSONL data format. Each line in the JSONL file is a single training example with the following structure:

{
  "systemInstruction": {
    "role": string,
    "parts": [
      {
        "text": string
      }
    ]
  },
  "contents": [
    {
      "role": string,
      "parts": [
        {
          // Union field data can be only one of the following:
          "text": string,
          "fileData": {
            "mimeType": string,
            "fileUri": string
          }
        }
      ]
    }
  ]
}

Each JSON object can contain the following fields:

  • systemInstruction: (Optional) An instruction for the model to steer its behavior, such as "Answer as concisely as possible." See Supported models. The text strings count toward the token limit. The role field is ignored.
  • contents: (Required) A conversation with the model. For single-turn queries, this is a single object. For multi-turn queries, this is a repeated field that contains the conversation history and the latest request. Each content object contains the following:
    • role: (Optional) The author of the message. Supported values are:
      • user: The message is from the user.
      • model: The message is from the model. This is used to provide context in multi-turn conversations.
    • parts: (Required) A list of ordered parts that make up a single message. Each part can have a different IANA MIME type. A part can be one of the following types:
      • text: A text prompt or code snippet.
      • fileData: Data stored in a file, specified by mimeType and a fileUri pointing to a file in Cloud Storage.
      • functionCall: A call to a function, containing the function's name and parameters. See Function calling.
      • functionResponse: The result of a functionCall, used as context for the model. See Function calling.
  • tools: (Optional) A set of tools the model can use to interact with external systems. See Function calling.

For limits on inputs, such as the maximum number of tokens or images, see the model specifications on the Google models page. To compute the number of tokens in your request, see Get token count.

Dataset example

Each conversation example in a tuning dataset is composed of a required messages field and an optional context field.

The messages field consists of an array of role-content pairs:

  • The role field refers to the author of the message and is set to either system, user, or model. The system role is optional and can only occur at the first element of the messages list. The user and model roles are required and can repeat in an alternating manner.
  • The content field is the content of the message.

For each example, the maximum token length for context and messages combined is 131,072 tokens. Additionally, each content field for the model field shouldn't exceed 8,192 tokens.

{
  "messages": [
    {
      "role": string,
      "content": string
    }
  ]
}

Best practices for data creation

Follow prompt design best practices

Your training data should follow the best practices for prompt design. Each example should provide a detailed description of the task and the desired output format.

Maintain consistency with production data

The examples in your datasets should match your expected production traffic. If your dataset contains specific formatting, keywords, instructions, or information, the production data should be formatted in the same way and contain the same instructions.

For example, if the examples in your dataset include a "question:" and a "context:", production traffic should also be formatted to include a "question:" and a "context:" in the same order as it appears in the dataset examples. If you exclude the context, the model will not recognize the pattern, even if the exact question was in an example in the dataset.

Upload the dataset to Cloud Storage

To run a tuning job, you need to upload one or more dataset files to a Cloud Storage bucket.

To run a tuning job, you need to upload one or more datasets to a Cloud Storage bucket. You can either create a new Cloud Storage bucket or use an existing one to store dataset files. The region of the bucket doesn't matter, but we recommend that you use a bucket that's in the same Google Cloud project where you plan to tune your model.

After your bucket is ready, upload your dataset file to the bucket.

What's next