Audio Information Extraction with OpenAI¶
This example demonstrates how to use Instructor with OpenAI's audio capabilities to extract structured information from audio files. The example shows how to process audio input and extract specific fields into a Pydantic model.
Prerequisites¶
- OpenAI API key with access to GPT-4 audio models
- An audio file in WAV format
- Instructor library installed with OpenAI support
Code Example¶
from openai import OpenAI
from pydantic import BaseModel
import instructor
from instructor.multimodal import Audio
import base64
# Initialize the OpenAI client with Instructor
client = instructor.from_openai(OpenAI())
# Define the structure for extracted information
class Person(BaseModel):
name: str
age: int
# Read and encode the audio file
with open("./output.wav", "rb") as f:
encoded_string = base64.b64encode(f.read()).decode("utf-8")
# Extract information from the audio
resp = client.chat.completions.create(
model="gpt-4-audio-preview",
response_model=Person,
modalities=["text"],
audio={"voice": "alloy", "format": "wav"},
messages=[
{
"role": "user",
"content": [
"Extract the following information from the audio",
Audio.from_path("./output.wav"),
],
},
],
)
print(resp)
# Example output: Person(name='Jason', age=20)
How It Works¶
-
First, we import the necessary libraries including the
Audio
class frominstructor.multimodal
. -
We define a Pydantic model
Person
that specifies the structure of the information we want to extract from the audio: name
: The person's name-
age
: The person's age -
The audio file is read and encoded in base64 format.
-
We use OpenAI's audio-capable model to process the audio and extract the specified information:
- The
model
parameter specifies the GPT-4 audio model response_model
tells Instructor to structure the output according to ourPerson
modelmodalities
specifies that we want text output- The
audio
parameter configures audio-specific settings -
In the message, we use
Audio.from_path()
to include the audio file -
The response is automatically parsed into our Pydantic model, making the extracted information easily accessible in a structured format.
Use Cases¶
This pattern is particularly useful for:
- Transcribing and extracting information from recorded interviews
- Processing voice messages or audio notes
- Automated form filling from voice input
- Voice-based data entry systems
Tips¶
- Ensure your audio file is in a supported format (WAV in this example)
- The audio model works best with clear speech and minimal background noise
- Consider the length of the audio file, as there may be model-specific limitations
- Structure your Pydantic model to match the information you expect to extract