Structured outputs with llama-cpp-python, a complete guide w/ instructor¶

This guide demonstrates how to use llama-cpp-python with Instructor to generate structured outputs. You'll learn how to use JSON schema mode and speculative decoding to create type-safe responses from local LLMs.

Open-source LLMS are gaining popularity, and llama-cpp-python has made the llama-cpp model available to obtain structured outputs using JSON schema via a mixture of constrained sampling and speculative decoding.

They also support a OpenAI compatible client, which can be used to obtain structured output as a in process mechanism to avoid any network dependency.

Patching¶

Instructor's patch enhances an create call it with the following features:

response_model in create calls that returns a pydantic model
max_retries in create calls that retries the call if it fails by using a backoff strategy

Learn More

To learn more, please refer to the docs. To understand the benefits of using Pydantic with Instructor, visit the tips and tricks section of the why use Pydantic page. If you want to check out examples of using Pydantic with Instructor, visit the examples page.

llama-cpp-python¶

Recently llama-cpp-python added support for structured outputs via JSON schema mode. This is a time-saving alternative to extensive prompt engineering and can be used to obtain structured outputs.

In this example we'll cover a more advanced use case of JSON_SCHEMA mode to stream out partial models. To learn more partial streaming check out partial streaming.

Quick Start with `from_provider`¶

If you run the llama-cpp-python server in OpenAI compatible mode, you can use the unified from_provider API to patch the client. Simply point the base URL at your local server:

import instructor

# Sync client
client = instructor.from_provider(
    "ollama/openhermes", base_url="http://localhost:8080/v1"
)

# Async client
async_client = instructor.from_provider(
    "ollama/openhermes", async_client=True, base_url="http://localhost:8080/v1"
)

You can then call chat.completions.create just like with any other provider.

import llama_cpp
import instructor
from llama_cpp.llama_speculative import LlamaPromptLookupDecoding
from pydantic import BaseModel


llama = llama_cpp.Llama(
    model_path="../../models/OpenHermes-2.5-Mistral-7B-GGUF/openhermes-2.5-mistral-7b.Q4_K_M.gguf",
    n_gpu_layers=-1,
    chat_format="chatml",
    n_ctx=2048,
    draft_model=LlamaPromptLookupDecoding(num_pred_tokens=2),
    logits_all=True,
    verbose=False,
)


create = instructor.patch(
    create=llama.create_chat_completion_openai_v1,
    mode=instructor.Mode.JSON_SCHEMA,
)


class UserDetail(BaseModel):
    name: str
    age: int


user = create(
    messages=[
        {
            "role": "user",
            "content": "Extract `Jason is 30 years old`",
        }
    ],
    response_model=UserDetail,
)

print(user)
#> name='Jason' age=30

Structured outputs with llama-cpp-python, a complete guide w/ instructor¶

Patching¶

llama-cpp-python¶

Quick Start with from_provider¶

Quick Start with `from_provider`¶