Working with structured outputs¶
If you've seen my talk on this topic, you can skip this chapter.
tl;dr
When we work with LLMs you find that many times we are not building chatbots, instead we're working with structured outputs in order to solve a problem by returning machine readable data. However the way we think about the problem is still very much influenced by the way we think about chatbots. This is a problem because it leads to a lot of confusion and frustration. In this chapter we'll try to understand why this happens and how we can fix it.
import traceback
RED = "\033[91m"
RESET = "\033[0m"
The fundamental problem with JSON and Dictionaries¶
Lets say we have a simple JSON object, and we want to work with it. We can use the json
module to load it into a dictionary, and then work with it. However, this is a bit of a pain, because we have to manually check the types of the data, and we have to manually check if the data is valid. For example, lets say we have a JSON object that looks like this:
data = [{"first_name": "Jason", "age": 10}, {"firstName": "Jason", "age": "10"}]
We have a name
field, which is a string, and an age
field, which is an integer. However, if we were to load this into a dictionary, we would have no way of knowing if the data is valid. For example, we could have a string for the age, or we could have a float for the age. We could also have a string for the name, or we could have a list for the name.
for obj in data:
name = obj.get("first_name")
age = obj.get("age")
print(f"{name} is {age}")
for obj in data:
name = obj.get("first_name")
age = obj.get("age")
try:
age_next_year = age + 1
print(f"Next year {name} will be {age_next_year} years old")
except TypeError:
traceback.print_exc()
Jason is 10 None is 10 Next year Jason will be 11 years old
Traceback (most recent call last): File "/var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_24047/2607506000.py", line 10, in <module> age_next_year = age + 1 ~~~~^~~ TypeError: can only concatenate str (not "int") to str
You see that while we were able to program with a dictionary, we had issues with the data being valid. We would have had to manually check the types of the data, and we had to manually check if the data was valid. This is a pain, and we can do better.
Pydantic to the rescue¶
Pydantic is a library that allows us to define data structures, and then validate them.
from pydantic import BaseModel, Field, ValidationError
class Person(BaseModel):
name: str
age: int
person = Person(name="Sam", age=30)
person
Person(name='Sam', age=30)
# Data is correctly casted to the right type
person = Person.model_validate({"name": "Sam", "age": "30"})
person
Person(name='Sam', age=30)
assert person.name == "Sam"
assert person.age == 30
try:
assert person.age == 20
except AssertionError:
traceback.print_exc()
Traceback (most recent call last): File "/var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_24047/3040264600.py", line 5, in <module> assert person.age == 20 ^^^^^^^^^^^^^^^^ AssertionError
# Data is validated to get better error messages
try:
person = Person.model_validate({"first_name": "Sam", "age": "30.2"})
except ValidationError as e:
print("Validation Error:")
for error in e.errors():
print(f"Field: {error['loc'][0]}, Error: {error['msg']}")
print(f"{RED}\nOriginal Traceback Below{RESET}")
traceback.print_exc()
Validation Error:
Field: name, Error: Field required
Field: age, Error: Input should be a valid integer, unable to parse string as an integer
Original Traceback Below
Traceback (most recent call last): File "/var/folders/l2/jjqj299126j0gycr9kkkt9xm0000gn/T/ipykernel_24047/621989455.py", line 3, in <module> person = Person.model_validate({"first_name": "Sam", "age": "30.2"}) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/opt/homebrew/Caskroom/miniconda/base/envs/instructor/lib/python3.11/site-packages/pydantic/main.py", line 509, in model_validate return cls.__pydantic_validator__.validate_python( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ pydantic_core._pydantic_core.ValidationError: 2 validation errors for Person name Field required [type=missing, input_value={'first_name': 'Sam', 'age': '30.2'}, input_type=dict] For further information visit https://errors.pydantic.dev/2.6/v/missing age Input should be a valid integer, unable to parse string as an integer [type=int_parsing, input_value='30.2', input_type=str] For further information visit https://errors.pydantic.dev/2.6/v/int_parsing
By introducing pydantic into any python codebase you can get a lot of benefits. You can get type checking, you can get validation, and you can get autocomplete. This is a huge win, because it means you can catch errors before they happen. This is even more useful when we rely on language models to generate data for us.
You can also define validators that are run on the data. This is useful because it means you can catch errors before they happen. For example, you can define a validator that checks if the age is greater than 0. This is useful because it means you can catch errors before they happen.
Fundamental problem with asking for JSON from OpenAI¶
As we shall see below, the correct json format would be something of the format below:
{
"name": "Jason",
"age": 10
}
However, we get errorenous outputs like:
{
"jason": 10
}
from openai import OpenAI
client = OpenAI()
resp = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{"role": "user", "content": "Please give me jason is 10 as a json object ```json\n"},
],
n=10,
temperature=1,
)
print("json that we want:")
print("""
{
"name": "Jason",
"age": 10
}
""")
for choice in resp.choices:
json = choice.message.content
try:
person = Person.model_validate_json(json)
print(f"correctly parsed {person=}")
except Exception as e:
print("error!!")
print(json)
json that we want: { "name": "Jason", "age": 10 } error!! { "jason": 10 } correctly parsed person=Person(name='Jason', age=10) correctly parsed person=Person(name='jason', age=10) error!! { "Jason": { "age": 10 } } error!! { "Jason": { "age": 10 } } error!! { "Jason": { "age": 10 } } error!! { "Jason": { "age": 10 } } correctly parsed person=Person(name='Jason', age=10) correctly parsed person=Person(name='Jason', age=10) error!! { "jason": 10 }
Introduction to Function Calling¶
The json could be anything! We could add more and more into a prompt and hope it works, or we can use something called function calling to directly specify the schema we want.
Function Calling
In an API call, you can describe functions and have the model intelligently choose to output a JSON object containing arguments to call one or many functions. The Chat Completions API does not call the function; instead, the model generates JSON that you can use to call the function in your code.
import datetime
class PersonBirthday(BaseModel):
name: str
age: int
birthday: datetime.date
schema = {
"properties": {
"name": {"type": "string"},
"age": {"type": "integer"},
"birthday": {"type": "string", "format": "YYYY-MM-DD"},
},
"required": ["name", "age"],
"type": "object",
}
resp = client.chat.completions.create(
model="gpt-3.5-turbo",
messages=[
{
"role": "user",
"content": f"Extract `Jason Liu is thirty years old his birthday is yesturday` into json today is {datetime.date.today()}",
},
],
functions=[{"name": "Person", "parameters": schema}],
function_call="auto",
)
PersonBirthday.model_validate_json(resp.choices[0].message.function_call.arguments)
PersonBirthday(name='Jason Liu', age=30, birthday=datetime.date(1994, 3, 26))
But it turns out, pydantic actually not only does our serialization, we can define the schema as well as add additional documentation!
PersonBirthday.model_json_schema()
{'properties': {'name': {'title': 'Name', 'type': 'string'}, 'age': {'title': 'Age', 'type': 'integer'}, 'birthday': {'format': 'date', 'title': 'Birthday', 'type': 'string'}}, 'required': ['name', 'age', 'birthday'], 'title': 'PersonBirthday', 'type': 'object'}
We can even define nested complex schemas, and documentation with ease.
class Address(BaseModel):
address: str = Field(description="Full street address")
city: str
state: str
class PersonAddress(Person):
"""A Person with an address"""
address: Address
PersonAddress.model_json_schema()
{'$defs': {'Address': {'properties': {'address': {'description': 'Full street address', 'title': 'Address', 'type': 'string'}, 'city': {'title': 'City', 'type': 'string'}, 'state': {'title': 'State', 'type': 'string'}}, 'required': ['address', 'city', 'state'], 'title': 'Address', 'type': 'object'}}, 'description': 'A Person with an address', 'properties': {'name': {'title': 'Name', 'type': 'string'}, 'age': {'title': 'Age', 'type': 'integer'}, 'address': {'$ref': '#/$defs/Address'}}, 'required': ['name', 'age', 'address'], 'title': 'PersonAddress', 'type': 'object'}
These simple concepts become what we built into instructor
and most of the work has been around documenting how we can leverage schema engineering. Except now we use instructor.patch()
to add a bunch more capabilities to the OpenAI SDK.
The core idea around Instructor¶
- Using function calling allows us use a llm that is finetuned to use json_schema and output json.
- Pydantic can be used to define the object, schema, and validation in one single class, allow us to encapsulate everything neatly
- As a library with 100M downloads, we can leverage pydantic to do all the heavy lifting for us and fit nicely with the python ecosystem
import instructor
import datetime
# patch the client to add `response_model` to the `create` method
client = instructor.patch(OpenAI(), mode=instructor.Mode.MD_JSON)
resp = client.chat.completions.create(
model="gpt-3.5-turbo-1106",
messages=[
{
"role": "user",
"content": f"""
Today is {datetime.date.today()}
Extract `Jason Liu is thirty years old his birthday is yesturday`
he lives at 123 Main St, San Francisco, CA""",
},
],
response_model=PersonAddress,
)
resp
PersonAddress(name='Jason Liu', age=30, address=Address(address='123 Main St', city='San Francisco', state='CA'))
By defining response_model
we can leverage pydantic to do all the heavy lifting. Later we'll introduce the other features that instructor.patch()
adds to the OpenAI SDK. but for now, this small change allows us to do a lot more with the API.
Is instructor the only way to do this?¶
No. Libraries like Marvin, Langchain, and Llamaindex all now leverage the Pydantic object in similar ways. The goal is to be as light weight as possible, get you as close as possible to the openai api, and then get out of your way.
More importantly, we've also added straight forward validation and reasking to the mix.
The goal of instructor is to show you how to think about structured prompting and provide examples and documentation that you can take with you to any framework.
For further exploration: