Semantic Validation with LLMs¶

This guide covers semantic validation in Instructor - using LLMs themselves to validate content against complex, subjective, or contextual criteria that would be difficult to implement with traditional rule-based approaches.

Overview¶

Semantic validation leverages the language understanding capabilities of LLMs to validate inputs against natural language criteria. While traditional validation uses explicit rules and patterns, semantic validation can understand nuance, context, and subjective qualities in data.

When to Use Semantic Validation¶

Semantic validation is particularly useful for:

Complex criteria that are difficult to express with rules
Subjective qualities like tone, style, or appropriateness
Contextual validation that requires understanding relationships between fields
Policy enforcement that involves nuanced understanding of guidelines
Content moderation for detecting harmful or inappropriate content

How It Works¶

In Instructor, semantic validation is implemented through the llm_validator function, which creates a validator that uses an LLM to check if values conform to specified requirements:

import instructor
from typing import Annotated
from pydantic import BaseModel, BeforeValidator
from instructor import llm_validator

# Initialize client
client = instructor.from_provider("openai/gpt-4.1-mini")

class UserComment(BaseModel):
    username: str
    comment: Annotated[
        str,
        BeforeValidator(
            llm_validator(
                "Comment must be constructive, respectful, and not contain hate speech or profanity",
                client=client
            )
        )
    ]

The llm_validator function takes:

A natural language description of the validation criteria
An Instructor client instance to perform the validation
Optional parameters for configuration

During validation, the LLM evaluates whether the input matches the specified criteria and either passes the value or raises a validation error with a detailed explanation.

Validation Flow¶

The following diagram illustrates how semantic validation works in Instructor:

flowchart TD
    A[Input Data] --> B[Pydantic Validation Process]
    B --> C{Field has Semantic\nValidator?}
    C -->|No| D[Standard Validation]
    C -->|Yes| E[Call LLM with Validation Criteria]
    E --> F{LLM Determines\nValue is Valid?}
    F -->|Yes| G[Validation Passes]
    F -->|No| H[Validation Fails with LLM-Generated Error]
    H --> I{Auto-Retry Enabled?}
    I -->|Yes| J[Try Again with Error Context]
    I -->|No| K[Return Validation Error]
    J --> E

    classDef process fill:#e2f0fb,stroke:#b8daff,color:#004085;
    classDef decision fill:#fff3cd,stroke:#ffeeba,color:#856404;
    classDef success fill:#d4edda,stroke:#c3e6cb,color:#155724;
    classDef error fill:#f8d7da,stroke:#f5c6cb,color:#721c24;

    class A,B,E,J process
    class C,F,I decision
    class G,D success
    class H,K error

Basic Usage¶

Here's a complete example of semantic validation in action:

# Standard library imports
from typing import Annotated

# Third-party imports
from pydantic import BaseModel, BeforeValidator
import instructor
from instructor import llm_validator

# Initialize client
client = instructor.from_provider("openai/gpt-4.1-mini")

class ProductDescription(BaseModel):
    """Model for validating product descriptions."""

    name: str
    description: Annotated[
        str,
        BeforeValidator(
            llm_validator(
                """The description must be:
                1. Professional and factual
                2. Free of excessive hyperbole or unsubstantiated claims
                3. Between 50-200 words in length
                4. Written in third person (no "you" or "your")
                5. Free of spelling and grammar errors""",
                client=client
            )
        )
    ]

# Example usage with Jinja templating
try:
    product = client.chat.completions.create(
        response_model=ProductDescription,
        messages=[
            {
                "role": "system",
                "content": "Generate a product description based on the product name."
            },
            {
                "role": "user",
                "content": "Create a description for: {{ product_name }}"
            }
        ],
        context={"product_name": "UltraClean 9000 Washing Machine"}
    )
    print(product.model_dump_json(indent=2))
except Exception as e:
    print(f"Validation error: {e}")

Advanced Validation Patterns¶

Content Policy Enforcement¶

This example validates user-generated content against community guidelines:

import instructor
from typing import Annotated
from pydantic import BaseModel, BeforeValidator
from instructor import llm_validator

client = instructor.from_provider("openai/gpt-4.1-mini")

class Comment(BaseModel):
    """Model representing a user comment with content moderation."""

    user_id: str
    content: Annotated[
        str,
        BeforeValidator(
            llm_validator(
                """Content must comply with community guidelines:
                - No hate speech, harassment, or discrimination
                - No explicit sexual or violent content
                - No promotion of illegal activities
                - No sharing of personal information
                - No spamming or excessive self-promotion""",
                client=client
            )
        )
    ]

Topic Relevance Validation¶

This validator ensures that responses stay on topic:

import instructor
from typing import Annotated
from pydantic import BaseModel, BeforeValidator
from instructor import llm_validator

client = instructor.from_provider("openai/gpt-4.1-mini")

class ForumPost(BaseModel):
    topic: str
    post: Annotated[
        str,
        BeforeValidator(
            llm_validator(
                "The post must be directly relevant to the specified topic and not drift to unrelated subjects",
                client=client
            )
        )
    ]

    # Using Jinja templating for validation against dynamic values
    @classmethod
    def validate_post(cls, topic_name: str, post_content: str) -> "ForumPost":
        return client.chat.completions.create(
            response_model=cls,
            messages=[
                {
                    "role": "system",
                    "content": """Validate that the forum post content stays relevant to the topic.
                    If it's not relevant, explain why in detail."""
                },
                {
                    "role": "user",
                    "content": """
                    Topic: {{ topic }}

                    Post content:
                    {{ post }}

                    Is this post relevant to the topic?
                    """
                }
            ],
            context={
                "topic": topic_name,
                "post": post_content
            }
        )

Fact-Checking Validator¶

This complex validator assesses factual accuracy:

import instructor
from typing import Annotated, List
from pydantic import BaseModel, Field, BeforeValidator
from instructor import llm_validator

client = instructor.from_provider("openai/gpt-4.1-mini")

class FactCheckedClaim(BaseModel):
    """Model for validating factual accuracy of claims."""

    claim: str
    is_accurate: bool = Field(description="Whether the claim is factually accurate")
    supporting_evidence: List[str] = Field(
        default_factory=list,
        description="Evidence supporting or refuting the claim"
    )

    @classmethod
    def validate_claim(cls, text: str) -> "FactCheckedClaim":
        return client.chat.completions.create(
            response_model=cls,
            messages=[
                {
                    "role": "system",
                    "content": "You are a fact-checking system. Assess the factual accuracy of the claim."
                },
                {
                    "role": "user",
                    "content": "Fact check this claim: {{ claim }}"
                }
            ],
            context={"claim": text}
        )

Complex Multi-Field Validation¶

For validation that needs to compare multiple fields, you can use model validators:

import instructor
from typing import List
from pydantic import BaseModel, model_validator
from instructor import Validator  # For response type

client = instructor.from_provider("openai/gpt-4.1-mini")

class Report(BaseModel):
    """Model representing a report with related fields that need semantic validation."""

    title: str
    summary: str
    key_findings: List[str]

    @model_validator(mode='after')
    def validate_consistency(self):
        # Semantic validation at the model level using Jinja templating
        validation_result = client.chat.completions.create(
            response_model=Validator,
            messages=[
                {
                    "role": "system",
                    "content": "Validate that the summary accurately reflects the key findings."
                },
                {
                    "role": "user",
                    "content": """
                        Please validate if this summary accurately reflects the key findings:

                        Title: {{ title }}
                        Summary: {{ summary }}

                        Key findings:
                        {% for finding in findings %}
                        - {{ finding }}
                        {% endfor %}

                        Evaluate for consistency, completeness, and accuracy.
                    """
                }
            ],
            context={
                "title": self.title,
                "summary": self.summary,
                "findings": self.key_findings
            }
        )

        if not validation_result.is_valid:
            raise ValueError(f"Consistency error: {validation_result.reason}")

        return self

Best Practices¶

Be Specific in Criteria: Provide clear, detailed validation criteria in natural language
Use Appropriate Models: Larger models tend to give better, more nuanced validation
Balance Cost and Latency: Remember that each validation adds an LLM API call
Provide Examples: Include examples of both valid and invalid content in your criteria
Handle Retries: Configure retry logic for edge cases
Use Jinja Templates: When validating against dynamic values, use Jinja templating
Separate Concerns: Keep validation criteria focused on specific aspects
Consider Context: Use model-level validation when comparing multiple fields

Advanced Configuration¶

The llm_validator function supports several configuration options:

from instructor import llm_validator
from pydantic import BaseModel, BeforeValidator
from typing import Annotated

# Configure the validator with options
validator = llm_validator(
    statement="Must be a professional, concise product description",
    client=client,  # Required Instructor client
    allow_override=True,  # Allow LLM to fix invalid values
    model="gpt-4o",  # Specify model to use for validation
    temperature=0.2  # Add variability (default is 0)
)

class Product(BaseModel):
    description: Annotated[str, BeforeValidator(validator)]

Performance Considerations¶

Semantic validation adds API calls to your workflow, which impacts:

Latency: Each validation requires an additional API call
Cost: More API calls mean higher usage costs
Reliability: Depends on API availability and response quality

Consider these trade-offs when implementing semantic validation, especially for high-volume applications.

Comparison with Rule-Based Validation¶

Aspect	Rule-Based Validation	Semantic Validation
Implementation	Regular expressions, constraints	Natural language criteria
Complexity	Simple rules, explicit patterns	Can handle subjective criteria
Speed	Fast, no external calls	Slower, requires API calls
Cost	No additional API costs	Each validation costs tokens
Flexibility	Limited to programmable rules	Can validate against any natural language criteria
Maintenance	Rules must be updated manually	Criteria can be more adaptable

Validation in Instructor - Core validation concepts
Custom Validators - Creating custom validators
llm_validator API Reference - Full API reference

Semantic validation expands what's possible with validation beyond traditional rule-based approaches. By using LLMs to validate content against natural language criteria, you can build more sophisticated validation systems that understand context, nuance, and complex relationships.