Skip to content

2023

Advanced Caching Strategies for Python LLM Applications (Validated & Tested ✅)

Instructor makes working with language models easy, but they are still computationally expensive. Smart caching strategies can reduce costs by up to 90% while dramatically improving response times.

Update (June 2025) – Instructor now ships native caching support out-of-the-box. Pass a cache adapter directly when you create a client:

from instructor import from_provider
from instructor.cache import AutoCache, RedisCache

client = from_provider(
    "openai/gpt-4o",  # or any other provider
    cache=AutoCache(maxsize=10_000),   # in-process LRU
    # or cache=RedisCache(host="localhost")
)

Under the hood this uses the very same techniques explained below, so you can still roll your own adapter if you need a bespoke backend. The remainder of the post walks through the design rationale in detail and is fully compatible with the built-in implementation.

Built-in cache – feature matrix

Method / helper Cached What is stored Notes
create(...) ✅ Yes Parsed Pydantic model + raw completion JSON
create_with_completion(...) ✅ Yes Same as above – second tuple element restored from cache
create_partial(...) ❌ No Streaming generators not cached (yet)
create_iterable(...) ❌ No Streaming generators not cached (yet)
Any call with stream=True ❌ No Provider always invoked

How serialization works

  1. Model – we call model_dump_json() which produces a compact, loss-less JSON string. On a cache hit we re-hydrate with model_validate_json() so you get the same BaseModel subclass instance.
  2. Raw completion – Instructor attaches the original ChatCompletion (or provider-specific) object to the model as _raw_response. We serialise this object too (when possible with model_dump_json(), otherwise a plain str() fallback) and restore it on a cache hit so create_with_completion() behaves identically.
Raw Response Reconstruction

For raw completion objects, we use a SimpleNamespace trick to reconstruct the original object structure:

# When caching:
raw_json = completion.model_dump_json()  # Serialize to JSON

# When restoring from cache:
import json
from types import SimpleNamespace
restored = json.loads(raw_json, object_hook=lambda d: SimpleNamespace(**d))

This approach allows us to restore the original dot-notation access patterns (e.g., completion.usage.total_tokens) without requiring the original class definitions. The SimpleNamespace objects behave identically to the original completion objects for attribute access while being much simpler to reconstruct from JSON.

Defensive Handling

The cache implementation includes multiple fallback strategies for different provider response types:

  1. Pydantic models (OpenAI, Anthropic) - Use model_dump_json() for perfect serialization
  2. Plain dictionaries - Use standard json.dumps() with default=str fallback
  3. Unpickleable objects - Fall back to string representation with a warning

This ensures the cache works reliably across all providers, even if they don't follow the same response object patterns.

Streaming limitations

The current implementation opts not to cache streaming helpers (create_partial, create_iterable, or stream=True). Replaying a realistic token-stream requires a dedicated design which is coming in a future release. Until then, those calls always reach the provider.

Today, we're diving deep into optimizing instructor code while maintaining the excellent developer experience offered by Pydantic models. We'll tackle the challenges of caching Pydantic models, typically incompatible with pickle, and explore comprehensive solutions using decorators like functools.cache. Then, we'll craft production-ready custom decorators with diskcache and redis to support persistent caching, distributed systems, and high-throughput applications.

Generators and LLM Streaming

Latency is crucial, especially in eCommerce and newer chat applications like ChatGPT. Streaming is the solution that enables us to enhance the user experience without the need for faster response times.

And what makes streaming possible? Generators!

Verifying LLM Citations with Pydantic

Ensuring the accuracy of information is crucial. This blog post explores how Pydantic's powerful and flexible validators can enhance data accuracy through citation verification.

We'll start with using a simple substring check to verify citations. Then we'll use instructor itself to power an LLM to verify citations and align answers with the given citations. Finally, we'll explore how we can use these techniques to generate a dataset of accurate responses.

Smarter Summaries w/ Finetuning GPT-3.5 and Chain of Density

Discover how to distil an iterative method like Chain Of Density into a single finetuned model using Instructor

In this article, we'll guide you through implementing the original Chain of Density method using Instructor, then show how to distile a GPT 3.5 model to match GPT-4's iterative summarization capabilities. Using these methods were able to decrease latency by 20x, reduce costs by 50x and maintain entity density.

By the end you'll end up with a GPT 3.5 model, (fine-tuned using Instructor's great tooling), capable of producing summaries that rival the effectiveness of Chain of Density [Adams et al. (2023)]. As always, all code is readily available in our examples/chain-of-density folder in our repo for your reference.

Good LLM Validation is Just Good Validation

What if your validation logic could learn and adapt like a human, but operate at the speed of software? This is the future of validation and it's already here.

Validation is the backbone of reliable software. But traditional methods are static, rule-based, and can't adapt to new challenges. This post looks at how to bring dynamic, machine learning-driven validation into your software stack using Python libraries like Pydantic and Instructor. We validate these outputs using a validation function which conforms to the structure seen below.

def validation_function(value):
    if condition(value):
        raise ValueError("Value is not valid")
    return mutation(value)

Enhancing Python Functions with Instructor: A Guide to Fine-Tuning and Distillation

Introduction

Get ready to dive deep into the world of fine-tuning task specific language models with Python functions. We'll explore how the instructor.instructions streamlines this process, making the task you want to distil more efficient and powerful while preserving its original functionality and backwards compatibility.

If you want to see the full example checkout examples/distillation

With the advent of large language models (LLM), retrieval augmented generation (RAG) has become a hot topic. However throughout the past year of helping startups integrate LLMs into their stack I've noticed that the pattern of taking user queries, embedding them, and directly searching a vector store is effectively demoware.

What is RAG?

Retrieval augmented generation (RAG) is a technique that uses an LLM to generate responses, but uses a search backend to augment the generation. In the past year using text embeddings with a vector databases has been the most popular approach I've seen being socialized.

RAG

Simple RAG that embedded the user query and makes a search.

So let's kick things off by examining what I like to call the 'Dumb' RAG Model-a basic setup that's more common than you'd think.