List Extraction¶

This guide explains how to extract lists (arrays) of structured data using Instructor. Lists are one of the most useful patterns for extracting multiple similar items from text.

Basic List Extraction¶

To extract a list of items, you define a model for a single item and then use Python's typing system to specify you want a list of that type:

from typing import List
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI

# Initialize the client
client = instructor.from_openai(OpenAI())

# Define a single item model
class Person(BaseModel):
    name: str = Field(..., description="The person's full name")
    age: int = Field(..., description="The person's age in years")

# Define a wrapper model for the list
class PeopleList(BaseModel):
    people: List[Person] = Field(..., description="List of people mentioned in the text")

# Extract the list
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": """
        Here's information about some people:
        - John Smith is 35 years old
        - Mary Johnson is 28 years old
        - Robert Davis is 42 years old
        """}
    ],
    response_model=PeopleList
)

# Access the extracted data
for i, person in enumerate(response.people):
    print(f"Person {i+1}: {person.name}, {person.age} years old")

This example shows how to: 1. Define a model for a single item (Person) 2. Create a wrapper model that contains a list of items (PeopleList) 3. Access each item in the list through the response

Direct List Extraction¶

You can also extract a list directly without a wrapper model:

from typing import List
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class Book(BaseModel):
    title: str
    author: str
    publication_year: int

# Extract a list directly
books = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": """
        Classic novels:
        1. To Kill a Mockingbird by Harper Lee (1960)
        2. 1984 by George Orwell (1949)
        3. The Great Gatsby by F. Scott Fitzgerald (1925)
        """}
    ],
    response_model=List[Book]  # Direct list extraction
)

# Access the extracted data
for book in books:
    print(f"{book.title} by {book.author} ({book.publication_year})")

Nested Lists¶

You can extract nested lists by combining list types:

from typing import List
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class Author(BaseModel):
    name: str
    nationality: str

class Book(BaseModel):
    title: str
    authors: List[Author]  # Nested list of authors
    publication_year: int

# Extract data with nested lists
books = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": """
        Book 1: "Good Omens" (1990)
        Authors: Terry Pratchett (British), Neil Gaiman (British)

        Book 2: "The Talisman" (1984)
        Authors: Stephen King (American), Peter Straub (American)
        """}
    ],
    response_model=List[Book]
)

# Access the nested data
for book in books:
    author_names = ", ".join([author.name for author in book.authors])
    print(f"{book.title} ({book.publication_year}) by {author_names}")

Using Streaming with Lists¶

You can stream list extraction results using Instructor's streaming capabilities:

from typing import List
import instructor
from openai import OpenAI
from pydantic import BaseModel, Field

client = instructor.from_openai(OpenAI())

class Task(BaseModel):
    description: str
    priority: str
    deadline: str

# Stream a list of tasks
for task in client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Generate a list of 5 sample tasks for a project manager"}
    ],
    response_model=List[Task],
    stream=True
):
    print(f"Received task: {task.description} (Priority: {task.priority}, Deadline: {task.deadline})")

For more information on streaming, see the Streaming Basics and Streaming Lists guides.

List Validation¶

You can add validation for both individual items and the entire list:

from typing import List
from pydantic import BaseModel, Field, field_validator, model_validator
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class Product(BaseModel):
    name: str
    price: float

    @field_validator('price')
    @classmethod
    def validate_price(cls, v):
        if v <= 0:
            raise ValueError("Price must be greater than zero")
        return v

class ProductList(BaseModel):
    products: List[Product] = Field(..., min_items=1)

    @model_validator(mode='after')
    def validate_unique_names(self):
        names = [p.name for p in self.products]
        if len(names) != len(set(names)):
            raise ValueError("All product names must be unique")
        return self

# Extract list with validation
response = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "List of products: Headphones ($50), Speakers ($80), Earbuds ($30)"}
    ],
    response_model=ProductList
)

For more on validation, see Field Validation and Validation Basics.

List Constraints¶

You can add constraints to lists using Pydantic's Field:

from typing import List
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

class Ingredient(BaseModel):
    name: str
    amount: str

class Recipe(BaseModel):
    title: str
    ingredients: List[Ingredient] = Field(
        ...,
        min_items=2,         # Minimum 2 ingredients
        max_items=10,        # Maximum 10 ingredients
        description="List of ingredients needed for the recipe"
    )
    steps: List[str] = Field(
        ...,
        min_items=1,
        description="Step-by-step instructions to prepare the recipe"
    )

Real-world Example: Task Extraction¶

Here's a more complete example for extracting a list of tasks from a meeting transcript:

from typing import List, Optional
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI
from datetime import date

client = instructor.from_openai(OpenAI())

class Assignee(BaseModel):
    name: str
    email: Optional[str] = None

class ActionItem(BaseModel):
    description: str = Field(..., description="The task that needs to be completed")
    assignee: Assignee = Field(..., description="The person responsible for the task")
    due_date: Optional[date] = Field(None, description="The deadline for the task")
    priority: str = Field(..., description="Priority level: Low, Medium, or High")

# Extract action items from meeting notes
action_items = client.chat.completions.create(
    model="gpt-4",
    messages=[
        {"role": "user", "content": """
        Meeting Notes - Project Kickoff
        Date: 2023-05-15

        Attendees: John (john@example.com), Sarah (sarah@example.com), Mike

        Discussion points:
        1. John will prepare the project timeline by next Friday. This is high priority.
        2. Sarah needs to contact the client for requirements clarification by Wednesday. Medium priority.
        3. Mike is responsible for setting up the development environment. Due by tomorrow, high priority.
        """}
    ],
    response_model=List[ActionItem]
)

# Process the extracted action items
for item in action_items:
    due_str = item.due_date.isoformat() if item.due_date else "Not specified"
    print(f"Task: {item.description}")
    print(f"Assignee: {item.assignee.name} ({item.assignee.email or 'No email'})")
    print(f"Due: {due_str}, Priority: {item.priority}")
    print("---")

For a more detailed example, see the Action Items Extraction example.

Simple Object Extraction - Extracting single objects
Nested Structure - Working with complex nested data
Streaming Lists - Streaming list results
Lists and Arrays - Concepts related to list extraction

Next Steps¶

Learn about Nested Structure for complex data
Explore Streaming Lists for handling large lists
Check out Field Validation for validation techniques