Skip to content

Multimodal

We've provided a few different sample files for you to use to test out these new features. All examples below use these files.

  • (Image) : An image of some blueberry plants image.jpg
  • (Audio) : A Recording of the Original Gettysburg Address : gettysburg.wav
  • (PDF) : A sample PDF file which contains a fake invoice invoice.pdf Instructor provides a unified, provider-agnostic interface for working with multimodal inputs like images and PDFs.

Instructor provides a unified, provider-agnostic interface for working with multimodal inputs like images, PDFs, and audio files.

With Instructor's multimodal objects, you can easily load media from URLs, local files, or base64 strings using a consistent API that works across different AI providers (OpenAI, Anthropic, Mistral, etc.).

Instructor handles all the provider-specific formatting requirements behind the scenes, ensuring your code remains clean and future-proof as provider APIs evolve. Let's see how to use the Image, Audio and PDF classes.

Image

This class represents an image that can be loaded from a URL or file path. It provides a set of methods to create Image instances from different sources (Eg. URLs, paths and base64 strings). The following shows which methods are supported for the individual providers.

Method OpenAI Anthropic Google GenAI
from_url()
from_path()
from_base64()
autodetect()

We also support Anthropic Prompt Caching for images with the `ImageWith

Usage

By using the Image class, we can abstract away the differences between the different formats, allowing you to work with a unified interface.

You can create an Image instance from a URL or file path using the from_url or from_path methods. The Image class will automatically convert the image to a base64-encoded string and include it in the API request.

import instructor
from instructor.multimodal import Image
import openai
from pydantic import BaseModel


class ImageDescription(BaseModel):
    description: str
    items: list[str]


# Use our sample image provided above.
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/image.jpg"

client = instructor.from_openai(openai.OpenAI())

response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=ImageDescription,
    messages=[
        {
            "role": "user",
            "content": [
                "What is in this image?",
                Image.from_url(url),
            ],
        }
    ],
)

print(response)
# > description='A bush with numerous clusters of blueberries surrounded by green leaves, under a cloudy sky.' items=['blueberries', 'green leaves', 'cloudy sky']

We also provide a autodetect_image keyword argument that allows you to provide URLs or file paths as normal strings when you set it to true.

You can see an example below.

import instructor
from instructor.multimodal import Image
import openai
from pydantic import BaseModel


class ImageDescription(BaseModel):
    description: str
    items: list[str]


# Download a sample image for demonstration
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/image.jpg"

client = instructor.from_openai(openai.OpenAI())

response = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=ImageDescription,
    autodetect_images=True,  # Set this to True
    messages=[
        {
            "role": "user",
            "content": ["What is in this image?", url],
        }
    ],
)

print(response)
# > description='A bush with numerous clusters of blueberries surrounded by green leaves, under a cloudy sky.' items=['blueberries', 'green leaves', 'cloudy sky']

If you'll like to support Anthropic prompt caching with images, we provide the ImageWithCacheControl Object to do so. Simply use the from_image_params method and you'll be able to leverage Anthropic's prompt caching.

import instructor
from instructor.multimodal import ImageWithCacheControl
import anthropic
from pydantic import BaseModel


class ImageDescription(BaseModel):
    description: str
    items: list[str]


# Download a sample image for demonstration
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/image.jpg"

client = instructor.from_anthropic(anthropic.Anthropic())

response, completion = client.chat.completions.create_with_completion(
    model="claude-3-5-sonnet-20240620",
    response_model=ImageDescription,
    autodetect_images=True,  # Set this to True
    messages=[
        {
            "role": "user",
            "content": [
                "What is in this image?",
                ImageWithCacheControl.from_image_params(
                    {
                        "source": url,
                        "cache_control": {
                            "type": "ephemeral",
                        },
                    }
                ),
            ],
        }
    ],
    max_tokens=1000,
)

print(response)
# > description='A bush with numerous clusters of blueberries surrounded by green leaves, under a cloudy sky.' items=['blueberries', 'green leaves', 'cloudy sky']

print(completion.usage.cache_creation_input_tokens)
# > 1820

By leveraging Instructor's multimodal capabilities, you can focus on building your application logic without worrying about the intricacies of each provider's image handling format. This not only saves development time but also makes your code more maintainable and adaptable to future changes in AI provider APIs.

Audio

Note : Only OpenAI and Gemini support audio files at the moment. For Gemini, we're passing in the raw bytes as bytes for this feature. If you'd like to use the Files API instead, we also support it, read more at to see how to do so.

Similar to the Image class, we provide methods to create Audio instances.

Method OpenAI Google GenAI
from_url()
from_path()

The Audio class represents an audio file that can be loaded from a URL or file path. It provides methods to create Audio instances using the from_path and from_url methods.

The Audio class will automatically convert it to a the right format and include it in the API request.

from openai import OpenAI
from pydantic import BaseModel
import instructor
from instructor.multimodal import Audio
import base64

# Initialize the client
client = instructor.from_openai(OpenAI())


# Define our response model
class AudioDescription(BaseModel):
    summary: str
    transcript: str


url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/gettysburg.wav"

# Make the API call with the audio file
resp = client.chat.completions.create(
    model="gpt-4o-audio-preview",
    response_model=AudioDescription,
    modalities=["text"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": [
                "Extract the following information from the audio:",
                Audio.from_url(url),
            ],
        },
    ],
)

print(resp)

PDF

The PDF class represents a PDF file that can be loaded from a URL or file path.

It provides methods to create PDF instances and is currently supported for OpenAI, Mistral, GenAI and Anthropic client integrations.

Method OpenAI Anthropic Google GenAI Mistral
from_url()
from_path()
from_base64()
autodetect()

For Gemini, we also provide two additional methods that make working with the google-genai files package easy which you can access in the PDFWithGenaiFile object.

For Anthropic, you can enable caching with the PDFWithCacheControl object. Note that this has caching configured by default for easy usage.

We provide examples of how to use all three object classes below.

Usage

 from openai import OpenAI
 import instructor
 from pydantic import BaseModel
 from instructor.multimodal import PDF

 # Set up the client
 url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf"
 client = instructor.from_openai(OpenAI())


 # Create a model for analyzing PDFs
 class Invoice(BaseModel):
     total: float
     items: list[str]


 # Load and analyze a PDF
 response = client.chat.completions.create(
     model="gpt-4o-mini",
     response_model=Invoice,
     messages=[
         {
             "role": "user",
             "content": [
                 "Analyze this document",
                 PDF.from_url(url),
             ],
         }
     ],
 )

 print(response)
 # > Total = 220, items = ['English Tea', 'Tofu']

Caching

If you'd like to cache the PDF for Anthropic, we provide the PDFWithCacheControl class which has caching configured by default.

from anthropic import Anthropic
import instructor
from pydantic import BaseModel
from instructor.multimodal import PDFWithCacheControl

# Set up the client
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf"
client = instructor.from_anthropic(Anthropic())


# Create a model for analyzing PDFs
class Invoice(BaseModel):
    total: float
    items: list[str]


# Load and analyze a PDF
response, completion = client.chat.completions.create_with_completion(
    model="claude-3-5-sonnet-20240620",
    response_model=Invoice,
    messages=[
        {
            "role": "user",
            "content": [
                "Analyze this document",
                PDFWithCacheControl.from_url(url),
            ],
        }
    ],
    max_tokens=1000,
)

print(response)
# > Total = 220, items = ['English Tea', 'Tofu']

print(completion.usage.cache_creation_input_tokens)
# > 2091

Using Files

We also provide a convinient wrapper around the Files API - allowing you to use both uploaded files and to block the main thread while your file is uploading.

In this example below, we download the sample PDF and then upload it using the Files api provided by the google.genai sdk.

from google.genai import Client
import instructor
from pydantic import BaseModel
from instructor.multimodal import PDFWithGenaiFile
import requests

# Set up the client
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf"
client = instructor.from_genai(Client())

with requests.get(url) as response:
    pdf_data = response.content
    with open("./invoice.pdf", "wb") as f:
        f.write(pdf_data)


# Create a model for analyzing PDFs
class Invoice(BaseModel):
    total: float
    items: list[str]


# Load and analyze a PDF
response = client.chat.completions.create(
    model="gemini-2.0-flash",
    response_model=Invoice,
    messages=[
        {
            "role": "user",
            "content": [
                "Analyze this document",
                PDFWithGenaiFile.from_new_genai_file(
                    file_path="./invoice.pdf",
                    retry_delay=10,
                    max_retries=20,
                ),
            ],
        }
    ],
)

print(response)
# > Total = 220, items = ['English Tea', 'Tofu']

If you've already uploaded your file ahead of time, we also support it. Just provide us with the file name as seen below

from google.genai import Client
import instructor
from pydantic import BaseModel
from instructor.multimodal import PDFWithGenaiFile
import requests

# Set up the client
url = "https://raw.githubusercontent.com/instructor-ai/instructor/main/tests/assets/invoice.pdf"
client = instructor.from_genai(Client())

with requests.get(url) as response:
    pdf_data = response.content
    with open("./invoice.pdf", "wb") as f:
        f.write(pdf_data)

file = client.files.upload(
    file="invoice.pdf",
)


# Create a model for analyzing PDFs
class Invoice(BaseModel):
    total: float
    items: list[str]


# Load and analyze a PDF
response = client.chat.completions.create(
    model="gemini-2.0-flash",
    response_model=Invoice,
    messages=[
        {
            "role": "user",
            "content": [
                "Analyze this document",
                PDFWithGenaiFile.from_existing_genai_file(file_name=file.name),
            ],
        }
    ],
)

print(response)
# > Total = 220, items = ['English Tea', 'Tofu']

This way you have more granular control over how the file is uploaded, potentially also processing multiple file uploads at once too.