Skip to content

Audio Information Extraction with OpenAI

This example demonstrates how to use Instructor with OpenAI's audio capabilities to extract structured information from audio files. The example shows how to process audio input and extract specific fields into a Pydantic model.

Prerequisites

  • OpenAI API key with access to GPT-4 audio models
  • An audio file in WAV format
  • Instructor library installed with OpenAI support

Code Example

from openai import OpenAI
from pydantic import BaseModel
import instructor
from instructor.multimodal import Audio
import base64

# Initialize the OpenAI client with Instructor
client = instructor.from_openai(OpenAI())

# Define the structure for extracted information
class Person(BaseModel):
    name: str
    age: int

# Read and encode the audio file
with open("./output.wav", "rb") as f:
    encoded_string = base64.b64encode(f.read()).decode("utf-8")

# Extract information from the audio
resp = client.chat.completions.create(
    model="gpt-4-audio-preview",
    response_model=Person,
    modalities=["text"],
    audio={"voice": "alloy", "format": "wav"},
    messages=[
        {
            "role": "user",
            "content": [
                "Extract the following information from the audio",
                Audio.from_path("./output.wav"),
            ],
        },
    ],
)

print(resp)
# Example output: Person(name='Jason', age=20)

How It Works

  1. First, we import the necessary libraries including the Audio class from instructor.multimodal.

  2. We define a Pydantic model Person that specifies the structure of the information we want to extract from the audio:

  3. name: The person's name
  4. age: The person's age

  5. The audio file is read and encoded in base64 format.

  6. We use OpenAI's audio-capable model to process the audio and extract the specified information:

  7. The model parameter specifies the GPT-4 audio model
  8. response_model tells Instructor to structure the output according to our Person model
  9. modalities specifies that we want text output
  10. The audio parameter configures audio-specific settings
  11. In the message, we use Audio.from_path() to include the audio file

  12. The response is automatically parsed into our Pydantic model, making the extracted information easily accessible in a structured format.

Use Cases

This pattern is particularly useful for:

  • Transcribing and extracting information from recorded interviews
  • Processing voice messages or audio notes
  • Automated form filling from voice input
  • Voice-based data entry systems

Tips

  • Ensure your audio file is in a supported format (WAV in this example)
  • The audio model works best with clear speech and minimal background noise
  • Consider the length of the audio file, as there may be model-specific limitations
  • Structure your Pydantic model to match the information you expect to extract