Your First LLM Extraction: Structured Outputs Tutorial¶
Learn how to extract structured data from LLMs using Instructor in this hands-on tutorial. We'll build a simple yet powerful example that demonstrates how to transform unstructured text into validated Python objects using GPT-4, Claude, or any supported LLM.
Quick Start: Extract Structured Data from LLMs¶
This LLM tutorial shows you how to extract structured information from natural language. We'll parse a person's name and age - a perfect starting point for understanding Instructor's power:
from pydantic import BaseModel
import instructor
from openai import OpenAI
# 1. Define your data model for LLM extraction
class Person(BaseModel):
name: str
age: int
# 2. Initialize Instructor with your LLM provider
client = instructor.from_openai(OpenAI())
# 3. Extract structured data from LLM
person = client.chat.completions.create(
model="gpt-3.5-turbo", # Works with GPT-4, Claude, Gemini, etc.
response_model=Person, # Type-safe extraction
messages=[
{"role": "user", "content": "John Doe is 30 years old"}
]
)
# 4. Use validated, structured data from LLM
print(f"Name: {person.name}, Age: {person.age}")
# Output: Name: John Doe, Age: 30
How Instructor LLM Extraction Works¶
┌─────────────┐ ┌──────────────┐ ┌─────────────┐
│ Define │ -> │ Instruct LLM │ -> │ Get Typed │
│ Structure │ │ to Extract │ │ Response │
└─────────────┘ └──────────────┘ └─────────────┘
Understanding the LLM structured output pipeline:
Step 1: Define Your LLM Output Schema¶
Pydantic models define the structure for LLM outputs: - name
: String field for extracting names from LLM - age
: Integer field with automatic type validation
Step 2: Configure Your LLM Client¶
Instructor enhances your LLM client with structured output capabilities. Works with OpenAI, Anthropic, Google, and 15+ providers.
Step 3: Execute LLM Extraction¶
person = client.chat.completions.create(
model="gpt-3.5-turbo",
response_model=Person,
messages=[
{"role": "user", "content": "John Doe is 30 years old"}
]
)
Key parameters for structured LLM outputs: - model
: Your chosen LLM (GPT-4, Claude, Gemini, etc.) - response_model
: Pydantic model for type-safe extraction - messages
: Input text for the LLM to process
Step 4: Work with Validated LLM Data¶
Get back a fully validated Python object from your LLM - no JSON parsing, no validation errors, just clean data ready to use.
Enhance LLM Extraction with Field Descriptions¶
Improve LLM accuracy by providing clear field descriptions:
from pydantic import BaseModel, Field
class Person(BaseModel):
name: str = Field(description="Person's full name")
age: int = Field(description="Person's age in years")
Field descriptions act as prompts, guiding the LLM to extract exactly what you need.
Handle Optional Data in LLM Responses¶
Real-world LLM extractions often have missing data. Handle it gracefully:
from typing import Optional
class Person(BaseModel):
name: str
age: Optional[int] = None # Now age is optional
Continue Your LLM Tutorial Journey¶
You've successfully extracted structured data from an LLM! Next steps:
- Advanced Response Models - Complex schemas for LLM outputs
- Multi-Provider Setup - Use GPT-4, Claude, Gemini interchangeably
- Production Patterns - Real-world LLM extraction examples
Common LLM Extraction Patterns¶
- Entity Extraction: Names, dates, locations from unstructured text
- Sentiment Analysis: Structured sentiment scores with reasoning
- Data Classification: Categorize text into predefined schemas
- Information Parsing: Convert documents into structured databases
Ready to build more complex LLM extractions? Continue to Response Models →