Generating Structured Output / JSON from LLMs¶

Language models have seen significant growth. Using them effectively often requires complex frameworks. This post discusses how Instructor simplifies this process using Pydantic.

The Problem with Existing LLM Frameworks¶

Current frameworks for Language Learning Models (LLMs) have complex setups. Developers find it hard to control interactions with language models. Some frameworks require complex JSON Schema setups.

The OpenAI Function Calling Game-Changer¶

OpenAI's Function Calling feature provides a constrained interaction model. However, it has its own complexities, mostly around JSON Schema.

Why Pydantic?¶

Instructor uses Pydantic to simplify the interaction between the programmer and the language model.

Widespread Adoption: Pydantic is a popular tool among Python developers.
Simplicity: Pydantic allows model definition in Python.
Framework Compatibility: Many Python frameworks already use Pydantic.

import pydantic
import instructor
from openai import OpenAI

# Enables the response_model
client = instructor.from_openai(OpenAI())


class UserDetail(pydantic.BaseModel):
    name: str
    age: int

    def introduce(self):
        return f"Hello I'm {self.name} and I'm {self.age} years old"


user: UserDetail = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=UserDetail,
    messages=[
        {"role": "user", "content": "Extract Jason is 25 years old"},
    ],
)

Simplifying Validation Flow with Pydantic¶

Pydantic validators simplify features like re-asking or self-critique. This makes these tasks less complex compared to other frameworks.

from typing_extensions import Annotated
from pydantic import BaseModel, BeforeValidator
from instructor import llm_validator


class QuestionAnswerNoEvil(BaseModel):
    question: str
    answer: Annotated[
        str,
        BeforeValidator(llm_validator("don't say objectionable things")),
    ]

The Modular Approach¶

Pydantic allows for modular output schemas. This leads to more organized code.

Composition of Schemas¶

class UserDetails(BaseModel):
    name: str
    age: int


class UserWithAddress(UserDetails):
    address: str

Defining Relationships¶

class UserDetail(BaseModel):
    id: int
    age: int
    name: str
    friends: List[int]


class UserRelationships(BaseModel):
    users: List[UserDetail]

Using Enums¶

from enum import Enum, auto


class Role(Enum):
    PRINCIPAL = auto()
    TEACHER = auto()
    STUDENT = auto()
    OTHER = auto()


class UserDetail(BaseModel):
    age: int
    name: str
    role: Role

Flexible Schemas¶

from typing import List


class Property(BaseModel):
    key: str
    value: str


class UserDetail(BaseModel):
    age: int
    name: str
    properties: List[Property]

Chain of Thought¶

class TimeRange(BaseModel):
    chain_of_thought: str
    start_time: int
    end_time: int


class UserDetail(BaseModel):
    id: int
    age: int
    name: str
    work_time: TimeRange
    leisure_time: TimeRange

Language Models as Microservices¶

The architecture resembles FastAPI. Most code can be written as Python functions that use Pydantic objects. This eliminates the need for prompt chains.

FastAPI Stub¶

import fastapi
from pydantic import BaseModel

class UserDetails(BaseModel):
    name: str
    age: int

app = fastapi.FastAPI()

@app.get("/user/{user_id}", response_model=UserDetails)
async def get_user(user_id: int) -> UserDetails:
    return ...

Using Instructor as a Function¶

def extract_user(str) -> UserDetails:
    return client.chat.completions(
           response_model=UserDetails,
           messages=[]
    )

Response Modeling¶

class MaybeUser(BaseModel):
    result: Optional[UserDetail]
    error: bool
    message: Optional[str]

Conclusion¶

Instructor, with Pydantic, simplifies interaction with language models. It is usable for both experienced and new developers.

Getting Started Guide - Learn how to install and use Instructor
Model Providers - Explore supported LLM providers
Validation Context - Understand how to validate LLM outputs
Response Models - Deep dive into defining structured outputs