Getting Started with Structured Outputs¶

Large language models (LLMs) are powerful tools for generating text, but extracting structured data from their outputs can be challenging. Structured outputs solve this problem by having LLMs return data in consistent, machine-readable formats.

The Problem with Unstructured Outputs¶

Let's look at what happens when we ask an LLM to extract information without any structure:

from openai import OpenAI

client = OpenAI()
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "user",
            "content": "Extract customer: John Doe, age 35, email: john@example.com",
        }
    ],
)

print(response.choices[0].message.content)

The output might look like:

Customer Name: John Doe
Age: 35
Email: john@example.com

Or it could be:

I found the following customer information:
- Name: John Doe
- Age: 35
- Email address: john@example.com

This inconsistency makes it difficult to reliably parse the information in downstream applications.

The Solution: Structured Outputs with Instructor¶

Instructor solves this problem by using Pydantic models to define the expected structure of the output:

import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, EmailStr

class Customer(BaseModel):
    name: str = Field(description="Customer's full name")
    age: int = Field(description="Customer's age in years", ge=0, le=120)
    email: EmailStr = Field(description="Customer's email address")

client = instructor.from_openai(OpenAI())
customer = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "user",
            "content": "Extract customer: John Doe, age 35, email: john@example.com",
        }
    ],
    response_model=Customer,  # This is the key part
)

print(customer)  # Customer(name='John Doe', age=35, email='john@example.com')
print(f"Name: {customer.name}, Age: {customer.age}, Email: {customer.email}")

The benefits of this approach include:

Consistency: Always get data in the same format
Validation: Age must be between 0 and 120, email must be valid
Type Safety: age is always an integer, not a string
Documentation: Model fields are self-documenting with descriptions

Complex Example: Nested Structures¶

Instructor shines with complex data structures:

from typing import List, Optional
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class Address(BaseModel):
    street: str
    city: str
    state: str
    zip_code: str

class Contact(BaseModel):
    email: Optional[str] = None
    phone: Optional[str] = None

class Person(BaseModel):
    name: str
    age: int
    occupation: str
    address: Address
    contact: Contact
    skills: List[str] = Field(description="List of professional skills")

person = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {
            "role": "user",
            "content": """
        Extract detailed information for this person:
        John Smith is a 42-year-old software engineer living at 123 Main St, San Francisco, CA 94105.
        His email is john.smith@example.com and phone is 555-123-4567.
        John is skilled in Python, JavaScript, and cloud architecture.
        """,
        }
    ],
    response_model=Person,
)

print(f"Name: {person.name}")
print(f"Location: {person.address.city}, {person.address.state}")
print(f"Skills: {', '.join(person.skills)}")

Installation¶

To get started with Instructor, install it via pip:

pip install instructor pydantic

You'll also need to set up your API keys for the LLM provider you're using.

Next Steps¶

In the next sections, you'll learn how to:

Create your first extraction
Understand the different response models you can create
Set up clients for various LLM providers