Performance Optimization¶

2025/01/08
in Performance Optimization, Cost Reduction, API Efficiency, Python Development
4 min read

Native Caching in Instructor v1.9.1: Zero-Configuration Performance Boost

New in v1.9.1: Instructor now ships with built-in caching support for all providers. Simply pass a cache adapter when creating your client to dramatically reduce API costs and improve response times.

Starting with Instructor v1.9.1, we've introduced native caching support that makes optimization effortless. Instead of implementing complex caching decorators or wrapper functions, you can now pass a cache adapter directly to from_provider() and automatically cache all your structured LLM calls.

The Game Changer: Built-in Caching

Before v1.9.1, caching required custom decorators and manual implementation. Now, it's as simple as:

from instructor import from_provider
from instructor.cache import AutoCache

# Works with any provider - caching flows through automatically
client = from_provider(
    "openai/gpt-4o",
    cache=AutoCache(maxsize=1000)
)

# Your normal calls are now cached automatically
from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int

first = client.create(
    messages=[{"role": "user", "content": "Extract: John is 25"}],
    response_model=User
)

second = client.create(
    messages=[{"role": "user", "content": "Extract: John is 25"}],
    response_model=User
)

# second call was served from cache - same result, zero cost!
assert first.name == second.name

Universal Provider Support

The beauty of native caching is that it works with every provider through the same simple API:

from instructor.cache import AutoCache, DiskCache

# Works with OpenAI
openai_client = from_provider("openai/gpt-3.5-turbo", cache=AutoCache())

# Works with Anthropic  
anthropic_client = from_provider("anthropic/claude-3-haiku", cache=AutoCache())

# Works with Google
google_client = from_provider("google/gemini-pro", cache=DiskCache())

# Works with any provider in the ecosystem
groq_client = from_provider("groq/llama-3.1-8b", cache=AutoCache())

No provider-specific configuration needed. The cache parameter flows through **kwargs to all underlying implementations automatically.

Built-in Cache Adapters

Instructor v1.9.1 ships with two production-ready cache implementations:

1. AutoCache - In-Process LRU Cache

Perfect for single-process applications and development:

from instructor.cache import AutoCache

# Thread-safe in-memory cache with LRU eviction
cache = AutoCache(maxsize=1000)
client = from_provider("openai/gpt-4o", cache=cache)

When to use: - Development and testing - Single-process applications - When you need maximum speed (200,000x+ faster cache hits) - Applications where cache persistence isn't required

2. DiskCache - Persistent Storage

Ideal when you need cache persistence across sessions:

from instructor.cache import DiskCache

# Persistent disk-based cache
cache = DiskCache(directory=".instructor_cache")
client = from_provider("anthropic/claude-3-sonnet", cache=cache)

When to use: - Applications that restart frequently - Development workflows where you want to preserve cache between sessions - When working with expensive or time-intensive API calls - Local applications with moderate performance requirements

Smart Cache Key Generation

Instructor automatically generates intelligent cache keys that include:

Provider/model name - Different models get different cache entries
Complete message history - Full conversation context is hashed
Response model schema - Any changes to your Pydantic model automatically bust the cache
Mode configuration - JSON vs Tools mode changes are tracked

This means when you update your Pydantic model (adding fields, changing descriptions, etc.), the cache automatically invalidates old entries - no stale data!

from instructor.cache import make_cache_key

# Generate deterministic cache key
key = make_cache_key(
    messages=[{"role": "user", "content": "hello"}],
    model="gpt-3.5-turbo", 
    response_model=User,
    mode="TOOLS"
)
print(key)  # SHA-256 hash: 9b8f5e2c8c9e...

Custom Cache Implementations

Want Redis, Memcached, or a custom backend? Simply inherit from BaseCache:

from instructor.cache import BaseCache
import redis

class RedisCache(BaseCache):
    def __init__(self, host="localhost", port=6379, **kwargs):
        self.redis = redis.Redis(host=host, port=port, **kwargs)

    def get(self, key: str):
        value = self.redis.get(key)
        return value.decode() if value else None

    def set(self, key: str, value, ttl: int | None = None):
        if ttl:
            self.redis.setex(key, ttl, value)
        else:
            self.redis.set(key, value)

# Use your custom cache
redis_cache = RedisCache(host="my-redis-server")
client = from_provider("openai/gpt-4o", cache=redis_cache)

The BaseCache interface is intentionally minimal - just implement get() and set() methods and you're ready to go.

Time-to-Live (TTL) Support

Control cache expiration with per-call TTL overrides:

# Cache this result for 1 hour
result = client.create(
    messages=[{"role": "user", "content": "Generate daily report"}],
    response_model=Report,
    cache_ttl=3600  # 1 hour in seconds
)

TTL support depends on your cache backend: - AutoCache: TTL is ignored (no expiration) - DiskCache: Full TTL support with automatic expiration - Custom backends: Implement TTL handling in your set() method

Migration from Manual Caching

If you were using custom caching decorators, migrating is straightforward:

Before v1.9.1:

@functools.cache
def extract_user(text: str) -> User:
    return client.create(
        messages=[{"role": "user", "content": text}],
        response_model=User
    )

With v1.9.1:

# Remove decorator, add cache to client
client = from_provider("openai/gpt-4o", cache=AutoCache())

def extract_user(text: str) -> User:
    return client.create(
        messages=[{"role": "user", "content": text}],
        response_model=User
    )

No more function-level caching logic - just create your client with caching enabled and all calls benefit automatically.

Real-World Performance Impact

Native caching delivers the same dramatic performance improvements you'd expect:

AutoCache: 200,000x+ speed improvement for cache hits
DiskCache: 5-10x improvement with persistence benefits
Cost Reduction: 50-90% API cost savings depending on cache hit rate

For a comprehensive deep-dive into caching strategies and performance analysis, check out our complete caching guide.

Getting Started

Ready to enable native caching? Here's your quick start:

Upgrade to v1.9.1+:
```
pip install "instructor>=1.9.1"
```

Choose your cache backend:

from instructor.cache import AutoCache, DiskCache

# For development/single-process
cache = AutoCache(maxsize=1000)

# For persistence
cache = DiskCache(directory=".cache")

Add cache to your client:

from instructor import from_provider

client = from_provider("your/favorite/model", cache=cache)

Use normally - caching happens automatically:

result = client.create(
    messages=[{"role": "user", "content": "your prompt"}],
    response_model=YourModel
)

Learn More

For detailed information about cache design, custom implementations, and advanced patterns, visit our Caching Concepts documentation.

The native caching feature represents our commitment to making high-performance LLM applications simple and accessible. No more complex caching logic - just fast, cost-effective structured outputs out of the box.

Have questions about native caching or want to share your use case? Join the discussion in our GitHub repository or check out the complete documentation.

2024/10/15
in API Development, Pydantic, Performance Optimization
2 min read

Introducing structured outputs with Cerebras Inference

What's Cerebras?

Cerebras offers the fastest inference on the market, 20x faster than on GPUs.

Sign up for a Cerebras Inference API key here at cloud.cerebras.ai.

Basic Usage

To get guaranteed structured outputs with Cerebras Inference, you

2023/11/26
in Performance Optimization, Cost Reduction, API Efficiency, Python Development
16 min read

Advanced Caching Strategies for Python LLM Applications (Validated & Tested ✅)

Instructor makes working with language models easy, but they are still computationally expensive. Smart caching strategies can reduce costs by up to 90% while dramatically improving response times.

Update (June 2025) – Instructor now ships native caching support out-of-the-box. Pass a cache adapter directly when you create a client:
from instructor import from_provider
from instructor.cache import AutoCache, RedisCache

client = from_provider(
    "openai/gpt-4o",  # or any other provider
    cache=AutoCache(maxsize=10_000),   # in-process LRU
    # or cache=RedisCache(host="localhost")
)
Under the hood this uses the very same techniques explained below, so you can still roll your own adapter if you need a bespoke backend. The remainder of the post walks through the design rationale in detail and is fully compatible with the built-in implementation.

Built-in cache – feature matrix

Method / helper	Cached	What is stored	Notes
`create(...)`	✅ Yes	Parsed Pydantic model + raw completion JSON
`create_with_completion(...)`	✅ Yes	Same as above – second tuple element restored from cache
`create_partial(...)`	❌ No	–	Streaming generators not cached (yet)
`create_iterable(...)`	❌ No	–	Streaming generators not cached (yet)
Any call with `stream=True`	❌ No	–	Provider always invoked

How serialization works

Model – we call model_dump_json() which produces a compact, loss-less JSON string. On a cache hit we re-hydrate with model_validate_json() so you get the same BaseModel subclass instance.
Raw completion – Instructor attaches the original ChatCompletion (or provider-specific) object to the model as _raw_response. We serialise this object too (when possible with model_dump_json(), otherwise a plain str() fallback) and restore it on a cache hit so create_with_completion() behaves identically.

Raw Response Reconstruction

For raw completion objects, we use a SimpleNamespace trick to reconstruct the original object structure:

# When caching:
raw_json = completion.model_dump_json()  # Serialize to JSON

# When restoring from cache:
import json
from types import SimpleNamespace
restored = json.loads(raw_json, object_hook=lambda d: SimpleNamespace(**d))

This approach allows us to restore the original dot-notation access patterns (e.g., completion.usage.total_tokens) without requiring the original class definitions. The SimpleNamespace objects behave identically to the original completion objects for attribute access while being much simpler to reconstruct from JSON.

Defensive Handling

The cache implementation includes multiple fallback strategies for different provider response types:

Pydantic models (OpenAI, Anthropic) - Use model_dump_json() for perfect serialization
Plain dictionaries - Use standard json.dumps() with default=str fallback
Unpickleable objects - Fall back to string representation with a warning

This ensures the cache works reliably across all providers, even if they don't follow the same response object patterns.

Streaming limitations

The current implementation opts not to cache streaming helpers (create_partial, create_iterable, or stream=True). Replaying a realistic token-stream requires a dedicated design which is coming in a future release. Until then, those calls always reach the provider.

Today, we're diving deep into optimizing instructor code while maintaining the excellent developer experience offered by Pydantic models. We'll tackle the challenges of caching Pydantic models, typically incompatible with pickle, and explore comprehensive solutions using decorators like functools.cache. Then, we'll craft production-ready custom decorators with diskcache and redis to support persistent caching, distributed systems, and high-throughput applications.