Working with structured outputs¶

If you've seen my talk on this topic, you can skip this chapter.

tl;dr

When we work with LLMs you find that many times we are not building chatbots, instead we're working with structured outputs in order to solve a problem by returning machine readable data. However the way we think about the problem is still very much influenced by the way we think about chatbots. This is a problem because it leads to a lot of confusion and frustration. In this chapter we'll try to understand why this happens and how we can fix it.

In [1]:

  Copied!     
 
import traceback
import traceback

In [2]:

  Copied!     
 
RED = "\033[91m"
RESET = "\033[0m"
RED = "\033[91m" RESET = "\033[0m"

The fundamental problem with JSON and Dictionaries¶

Lets say we have a simple JSON object, and we want to work with it. We can use the json module to load it into a dictionary, and then work with it. However, this is a bit of a pain, because we have to manually check the types of the data, and we have to manually check if the data is valid. For example, lets say we have a JSON object that looks like this:

In [3]:

  Copied!     
 
data = [{"first_name": "Jason", "age": 10}, {"firstName": "Jason", "age": "10"}]
data = [{"first_name": "Jason", "age": 10}, {"firstName": "Jason", "age": "10"}]

We have a name field, which is a string, and an age field, which is an integer. However, if we were to load this into a dictionary, we would have no way of knowing if the data is valid. For example, we could have a string for the age, or we could have a float for the age. We could also have a string for the name, or we could have a list for the name.

In [4]:

  Copied!     
 
for obj in data:
    name = obj.get("first_name")
    age = obj.get("age")
    print(f"{name} is {age}")

for obj in data:
    name = obj.get("first_name")
    age = obj.get("age")
    try:
        age_next_year = age + 1
        print(f"Next year {name} will be {age_next_year} years old")
    except TypeError:
        traceback.print_exc()
for obj in data: name = obj.get("first_name") age = obj.get("age") print(f"{name} is {age}") for obj in data: name = obj.get("first_name") age = obj.get("age") try: age_next_year = age + 1 print(f"Next year {name} will be {age_next_year} years old") except TypeError: traceback.print_exc()

Jason is 10
None is 10
Next year Jason will be 11 years old

Traceback (most recent call last):
  File "/var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_24047/2607506000.py", line 10, in <module>
    age_next_year = age + 1
                    ~~~~^~~
TypeError: can only concatenate str (not "int") to str

You see that while we were able to program with a dictionary, we had issues with the data being valid. We would have had to manually check the types of the data, and we had to manually check if the data was valid. This is a pain, and we can do better.

Pydantic to the rescue¶

Pydantic is a library that allows us to define data structures, and then validate them.

In [5]:

  Copied!     
 
from pydantic import BaseModel, Field, ValidationError

class Person(BaseModel):
    name: str
    age: int


person = Person(name="Sam", age=30)
person
from pydantic import BaseModel, Field, ValidationError class Person(BaseModel): name: str age: int person = Person(name="Sam", age=30) person

Out[5]:

Person(name='Sam', age=30)

In [6]:

  Copied!     
 
# Data is correctly casted to the right type
person = Person.model_validate({"name": "Sam", "age": "30"})
person
# Data is correctly casted to the right type person = Person.model_validate({"name": "Sam", "age": "30"}) person

Out[6]:

Person(name='Sam', age=30)

In [7]:

  Copied!     
 
assert person.name == "Sam"
assert person.age == 30

try:
    assert person.age == 20
except AssertionError:
    traceback.print_exc()
assert person.name == "Sam" assert person.age == 30 try: assert person.age == 20 except AssertionError: traceback.print_exc()

Traceback (most recent call last):
  File "/var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_24047/3040264600.py", line 5, in <module>
    assert person.age == 20
           ^^^^^^^^^^^^^^^^
AssertionError

In [8]:

  Copied!     
 
# Data is validated to get better error messages
try:
    person = Person.model_validate({"first_name": "Sam", "age": "30.2"})
except ValidationError as e:
    print("Validation Error:")
    for error in e.errors():
        print(f"Field: {error['loc'][0]}, Error: {error['msg']}")

    print(f"{RED}\nOriginal Traceback Below{RESET}")
    traceback.print_exc()
# Data is validated to get better error messages try: person = Person.model_validate({"first_name": "Sam", "age": "30.2"}) except ValidationError as e: print("Validation Error:") for error in e.errors(): print(f"Field: {error['loc'][0]}, Error: {error['msg']}") print(f"{RED}\nOriginal Traceback Below{RESET}") traceback.print_exc()

Validation Error:
Field: name, Error: Field required
Field: age, Error: Input should be a valid integer, unable to parse string as an integer

Original Traceback Below

Traceback (most recent call last):
  File "/var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_24047/621989455.py", line 3, in <module>
    person = Person.model_validate({"first_name": "Sam", "age": "30.2"})
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/homebrew/Caskroom/miniconda/base/envs/instructor/lib/python3.11/site-packages/pydantic/main.py", line 509, in model_validate
    return cls.__pydantic_validator__.validate_python(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pydantic_core._pydantic_core.ValidationError: 2 validation errors for Person
name
  Field required [type=missing, input_value={'first_name': 'Sam', 'age': '30.2'}, input_type=dict]
    For further information visit https://errors.pydantic.dev/2.6/v/missing
age
  Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='30.2', input_type=str]
    For further information visit https://errors.pydantic.dev/2.6/v/int_parsing

By introducing pydantic into any python codebase you can get a lot of benefits. You can get type checking, you can get validation, and you can get autocomplete. This is a huge win, because it means you can catch errors before they happen. This is even more useful when we rely on language models to generate data for us.

You can also define validators that are run on the data. This is useful because it means you can catch errors before they happen. For example, you can define a validator that checks if the age is greater than 0. This is useful because it means you can catch errors before they happen.

Fundamental problem with asking for JSON from OpenAI¶

As we shall see below, the correct json format would be something of the format below:

{
    "name": "Jason",
    "age": 10
}

However, we get errorenous outputs like:

{
  "jason": 10
}

In [9]:

  Copied!     
 
from openai import OpenAI

client = OpenAI()

resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {"role": "user", "content": "Please give me jason is 10 as a json object ```json\n"},
    ],
    n=10,
    temperature=1,
)

print("json that we want:")
print("""
{
    "name": "Jason",
    "age": 10
}
""")

for choice in resp.choices:
    json = choice.message.content
    try:
        person = Person.model_validate_json(json)
        print(f"correctly parsed {person=}")
    except Exception as e:
        print("error!!")
        print(json)
from openai import OpenAI client = OpenAI() resp = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "user", "content": "Please give me jason is 10 as a json object ```json\n"}, ], n=10, temperature=1, ) print("json that we want:") print(""" { "name": "Jason", "age": 10 } """) for choice in resp.choices: json = choice.message.content try: person = Person.model_validate_json(json) print(f"correctly parsed {person=}") except Exception as e: print("error!!") print(json)

json that we want:

{
    "name": "Jason",
    "age": 10
}

error!!
{
  "jason": 10
}
correctly parsed person=Person(name='Jason', age=10)
correctly parsed person=Person(name='jason', age=10)
error!!
{
  "Jason": {
    "age": 10
  }
}
error!!
{
  "Jason": {
    "age": 10
  }
}
error!!
{
  "Jason": {
    "age": 10
  }
}
error!!
{
  "Jason": {
    "age": 10
  }
}
correctly parsed person=Person(name='Jason', age=10)
correctly parsed person=Person(name='Jason', age=10)
error!!
{
  "jason": 10
}

Introduction to Function Calling¶

The json could be anything! We could add more and more into a prompt and hope it works, or we can use something called function calling to directly specify the schema we want.

Function Calling

In an API call, you can describe functions and have the model intelligently choose to output a JSON object containing arguments to call one or many functions. The Chat Completions API does not call the function; instead, the model generates JSON that you can use to call the function in your code.

In [10]:

  Copied!     
 
import datetime


class PersonBirthday(BaseModel):
    name: str
    age: int
    birthday: datetime.date


schema = {
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer"},
        "birthday": {"type": "string", "format": "YYYY-MM-DD"},
    },
    "required": ["name", "age"],
    "type": "object",
}

resp = client.chat.completions.create(
    model="gpt-3.5-turbo",
    messages=[
        {
            "role": "user",
            "content": f"Extract `Jason Liu is thirty years old his birthday is yesturday` into json today is {datetime.date.today()}",
        },
    ],
    functions=[{"name": "Person", "parameters": schema}],
    function_call="auto",
)

PersonBirthday.model_validate_json(resp.choices[0].message.function_call.arguments)
import datetime class PersonBirthday(BaseModel): name: str age: int birthday: datetime.date schema = { "properties": { "name": {"type": "string"}, "age": {"type": "integer"}, "birthday": {"type": "string", "format": "YYYY-MM-DD"}, }, "required": ["name", "age"], "type": "object", } resp = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ { "role": "user", "content": f"Extract `Jason Liu is thirty years old his birthday is yesturday` into json today is {datetime.date.today()}", }, ], functions=[{"name": "Person", "parameters": schema}], function_call="auto", ) PersonBirthday.model_validate_json(resp.choices[0].message.function_call.arguments)

Out[10]:

PersonBirthday(name='Jason Liu', age=30, birthday=datetime.date(1994, 3, 26))

But it turns out, pydantic actually not only does our serialization, we can define the schema as well as add additional documentation!

In [11]:

  Copied!     
 
PersonBirthday.model_json_schema()
PersonBirthday.model_json_schema()

Out[11]:

{'properties': {'name': {'title': 'Name', 'type': 'string'},
  'age': {'title': 'Age', 'type': 'integer'},
  'birthday': {'format': 'date', 'title': 'Birthday', 'type': 'string'}},
 'required': ['name', 'age', 'birthday'],
 'title': 'PersonBirthday',
 'type': 'object'}

We can even define nested complex schemas, and documentation with ease.

In [12]:

  Copied!     
 
class Address(BaseModel):
    address: str = Field(description="Full street address")
    city: str
    state: str


class PersonAddress(Person):
    """A Person with an address"""

    address: Address


PersonAddress.model_json_schema()
class Address(BaseModel): address: str = Field(description="Full street address") city: str state: str class PersonAddress(Person): """A Person with an address""" address: Address PersonAddress.model_json_schema()

Out[12]:

{'$defs': {'Address': {'properties': {'address': {'description': 'Full street address',
     'title': 'Address',
     'type': 'string'},
    'city': {'title': 'City', 'type': 'string'},
    'state': {'title': 'State', 'type': 'string'}},
   'required': ['address', 'city', 'state'],
   'title': 'Address',
   'type': 'object'}},
 'description': 'A Person with an address',
 'properties': {'name': {'title': 'Name', 'type': 'string'},
  'age': {'title': 'Age', 'type': 'integer'},
  'address': {'$ref': '#/$defs/Address'}},
 'required': ['name', 'age', 'address'],
 'title': 'PersonAddress',
 'type': 'object'}

These simple concepts become what we built into instructor and most of the work has been around documenting how we can leverage schema engineering. Except now we use instructor.patch() to add a bunch more capabilities to the OpenAI SDK.

The core idea around Instructor¶

Using function calling allows us use a llm that is finetuned to use json_schema and output json.
Pydantic can be used to define the object, schema, and validation in one single class, allow us to encapsulate everything neatly
As a library with 100M downloads, we can leverage pydantic to do all the heavy lifting for us and fit nicely with the python ecosystem

In [13]:

  Copied!     
 
import instructor
import datetime

# patch the client to add `response_model` to the `create` method
client = instructor.patch(OpenAI(), mode=instructor.Mode.MD_JSON)

resp = client.chat.completions.create(
    model="gpt-3.5-turbo-1106",
    messages=[
        {
            "role": "user",
            "content": f"""
            Today is {datetime.date.today()}

            Extract `Jason Liu is thirty years old his birthday is yesturday`
            he lives at 123 Main St, San Francisco, CA""",
        },
    ],
    response_model=PersonAddress,
)
resp
import instructor import datetime # patch the client to add `response_model` to the `create` method client = instructor.patch(OpenAI(), mode=instructor.Mode.MD_JSON) resp = client.chat.completions.create( model="gpt-3.5-turbo-1106", messages=[ { "role": "user", "content": f""" Today is {datetime.date.today()} Extract `Jason Liu is thirty years old his birthday is yesturday` he lives at 123 Main St, San Francisco, CA""", }, ], response_model=PersonAddress, ) resp

Out[13]:

PersonAddress(name='Jason Liu', age=30, address=Address(address='123 Main St', city='San Francisco', state='CA'))

By defining response_model we can leverage pydantic to do all the heavy lifting. Later we'll introduce the other features that instructor.patch() adds to the OpenAI SDK. but for now, this small change allows us to do a lot more with the API.

Is instructor the only way to do this?¶

No. Libraries like Marvin, Langchain, and Llamaindex all now leverage the Pydantic object in similar ways. The goal is to be as light weight as possible, get you as close as possible to the openai api, and then get out of your way.

More importantly, we've also added straight forward validation and reasking to the mix.

The goal of instructor is to show you how to think about structured prompting and provide examples and documentation that you can take with you to any framework.

For further exploration: