Skip to content

Subscribe to our Newsletter for Updates and Tips

If you want to get updates on new features and tips on how to use Instructor, you can subscribe to our newsletter below to get notified when we publish new content.

Advanced Topics

  1. Query Understanding: Beyond Embeddings
  2. Achieving GPT-4 Level Summaries with GPT-3.5-turbo
  3. Basics of Guardrails and Validation in AI Models
  4. Validating Citations in AI-Generated Content
  5. Fine-tuning and Distillation in AI Models
  6. Enhancing OpenAI Client Observability with LangSmith
  7. Logfire Integration with Pydantic

AI Development and Optimization

Language Models and Prompting Techniques

Integrations and Tools

Media and Resources

Audio Support in OpenAI's Chat Completions API

OpenAI has recently introduced audio support in their Chat Completions API, opening up exciting new possibilities for developers working with audio and text interactions. This feature is powered by the new gpt-4o-audio-preview model, which brings advanced voice capabilities to the familiar Chat Completions API interface.

Key Features

The new audio support in the Chat Completions API offers several compelling features:

  1. Flexible Input Handling: The API can now process any combination of text and audio inputs, allowing for more versatile applications.

  2. Natural, Steerable Voices: Similar to the Realtime API, developers can use prompting to shape various aspects of the generated audio, including language, pronunciation, and emotional range.

  3. Tool Calling Integration: The audio support seamlessly integrates with existing tool calling functionality, enabling complex workflows that combine audio, text, and external tools.

Practical Example

To demonstrate how to use this new functionality, let's look at a simple example using the instructor library:

"""python from openai import OpenAI from pydantic import BaseModel import instructor from instructor.multimodal import Audio import base64

client = instructor.from_openai(OpenAI())

class Person(BaseModel): name: str age: int

resp = client.chat.completions.create( model="gpt-4o-audio-preview", response_model=Person, modalities=["text"], audio={"voice": "alloy", "format": "wav"}, messages=[ { "role": "user", "content": [ "Extract the following information from the audio", Audio.from_path("./output.wav"), ], }, ], )

print(resp)

Expected output: Person(name='Jason', age=20)

"""

In this example, we're using the gpt-4o-audio-preview model to extract information from an audio file. The API processes the audio input and returns structured data (a Person object with name and age) based on the content of the audio.

Use Cases

The addition of audio support to the Chat Completions API enables a wide range of applications:

  1. Voice-based Personal Assistants: Create more natural and context-aware voice interfaces for various applications.

  2. Audio Content Analysis: Automatically extract information, sentiments, or key points from audio recordings or podcasts.

  3. Language Learning Tools: Develop interactive language learning applications that can process and respond to spoken language.

  4. Accessibility Features: Improve accessibility in applications by providing audio-based interactions and text-to-speech capabilities.

Considerations

While this new feature is exciting, it's important to note that it's best suited for asynchronous use cases that don't require extremely low latencies. For more dynamic and real-time interactions, OpenAI recommends using their Realtime API.

As with any AI-powered feature, it's crucial to consider ethical implications and potential biases in audio processing and generation. Always test thoroughly and consider the diversity of your user base when implementing these features.

Flashcard generator with Instructor + Burr

Flashcards help break down complex topics and learn anything from biology to a new language or lines for a play. This blog will show how to use LLMs to generate flashcards and kickstart your learning!

Instructor lets us get structured outputs from LLMs reliably, and Burr helps create an LLM application that's easy to understand and debug. It comes with Burr UI, a free, open-source, and local-first tool for observability, annotations, and more!

Building a Pairwise LLM Judge with Instructor and Pydantic

In this blog post, we'll explore how to create a pairwise LLM judge using Instructor and Pydantic. This judge will evaluate the relevance between a question and a piece of text, demonstrating a practical application of structured outputs in language model interactions.

Introduction

Evaluating text relevance is a common task in natural language processing and information retrieval. By leveraging large language models (LLMs) and structured outputs, we can create a system that judges the similarity or relevance between a question and a given text.

Setting Up the Environment

First, let's set up our environment with the necessary imports:

import instructor
import openai
from pydantic import BaseModel, Field

client = instructor.from_openai(openai.OpenAI())

Here, we're using the instructor library, which integrates seamlessly with OpenAI's API and Pydantic for structured outputs.

Defining the Judgment Model

We'll use Pydantic to define a Judgment model that structures the output of our LLM:

class Judgment(BaseModel):
    thought: str = Field(
        description="The step-by-step reasoning process used to analyze the question and text"
    )
    justification: str = Field(
        description="Explanation for the similarity judgment, detailing key factors that led to the conclusion"
    )
    similarity: bool = Field(
        description="Boolean judgment indicating whether the question and text are similar or relevant (True) or not (False)"
    )

This model ensures that our LLM's output is structured and includes a thought process, justification, and a boolean similarity judgment.

Creating the Judge Function

Next, we'll create a function that uses our LLM to judge the relevance between a question and a text:

def judge_relevance(question: str, text: str) -> Judgment:
    return client.chat.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """
                    You are tasked with comparing a question and a piece of text to determine if they are relevant to each other or similar in some way. Your goal is to analyze the content, context, and potential connections between the two.

                    To determine if the question and text are relevant or similar, please follow these steps:

                    1. Carefully read and understand both the question and the text.
                    2. Identify the main topic, keywords, and concepts in the question.
                    3. Analyze the text for any mention of these topics, keywords, or concepts.
                    4. Consider any potential indirect connections or implications that might link the question and text.
                    5. Evaluate the overall context and purpose of both the question and the text.

                    As you go through this process, please use a chain of thought approach. Write out your reasoning for each step inside <thought> tags.

                    After your analysis, provide a boolean judgment on whether the question and text are similar or relevant to each other. Use "true" if they are similar or relevant, and "false" if they are not.

                    Before giving your final judgment, provide a justification for your decision. Explain the key factors that led to your conclusion.

                    Please ensure your analysis is thorough, impartial, and based on the content provided.
                """
            },
            {
                "role": "user",
                "content": """
                    Here is the question:

                    <question>
                    {{question}}
                    </question>

                    Here is the text:
                    <text>
                    {{text}}
                    </text>
                """
            },
            },
        ],
        response_model=Judgment,
        context={"question": question, "text": text},
    )

This function takes a question and a text as input, sends them to the LLM with a predefined prompt, and returns a structured Judgment object.

Testing the Judge

To test our pairwise LLM judge, we can create a set of test pairs and evaluate the judge's performance:

if __name__ == "__main__":
    test_pairs = [
        {
            "question": "What are the main causes of climate change?",
            "text": "Global warming is primarily caused by human activities, such as burning fossil fuels, deforestation, and industrial processes. These activities release greenhouse gases into the atmosphere, trapping heat and leading to a rise in global temperatures.",
            "is_similar": True,
        },
        # ... (other test pairs)
    ]

    score = 0
    for pair in test_pairs:
        result = judge_relevance(pair["question"], pair["text"])
        if result.similarity == pair["is_similar"]:
            score += 1

    print(f"Score: {score}/{len(test_pairs)}")
    # > Score 9/10

This test loop runs the judge on each pair and compares the result to a predetermined similarity value, calculating an overall score.

Conclusion

By combining Instructor, Pydantic, and OpenAI's language models, we've created a powerful tool for judging text relevance. This approach demonstrates the flexibility and power of structured outputs in LLM applications.

The pairwise LLM judge we've built can be used in various scenarios, such as:

  1. Improving search relevance in information retrieval systems
  2. Evaluating the quality of question-answering systems
  3. Assisting in content recommendation algorithms
  4. Automating parts of the content moderation process

As you explore this technique, consider how you might extend or adapt it for your specific use cases. The combination of structured outputs and large language models opens up a world of possibilities for creating intelligent, interpretable AI systems.

Introducing structured outputs with Cerebras Inference

What's Cerebras?

Cerebras offers the fastest inference on the market, 20x faster than on GPUs.

Sign up for a Cerebras Inference API key here at cloud.cerebras.ai.

Basic Usage

To get guaranteed structured outputs with Cerebras Inference, you

  1. Create a new Instructor client with the from_cerebras method
  2. Define a Pydantic model to pass into the response_model parameter
  3. Get back a validated response exactly as you would expect

You'll also need to install the Cerebras SDK to use the client. You can install it with the command below.

OpenAI API Model Distillation with Instructor

OpenAI has recently introduced a new feature called API Model Distillation, which allows developers to create custom models tailored to their specific use cases. This feature is particularly powerful when combined with Instructor's structured output capabilities. In this post, we'll explore how to leverage API Model Distillation with Instructor to create more efficient and specialized models.

Bad Schemas could break your LLM Structured Outputs

You might be leaving up to 60% performance gains on the table with the wrong response model. Response Models impact model performance massively with Claude and GPT-4o, irregardless of you’re using JSON mode or Tool Calling.

Using the right response model can help ensure your models respond in the right language or prevent hallucinations when extracting video timestamps.

We decided to investigate this by benchmarking Claude and GPT-4o on the GSM8k dataset and found that

  1. Field Naming drastically impacts performance - Changing a single field name from final_choice to answer improved model accuracy from 4.5% to 95%. The way we structure and name fields in our response models can fundamentally alter how the model interprets and responds to queries.
  2. Chain Of Thought significantly boosts performance - Adding a reasoning field increased model accuracy by 60% on the GSM8k dataset. Models perform significantly better when they explain their logic step-by-step.
  3. Be careful with JSON mode - JSON mode exhibited 50% more performance variation than Tool Calling when renaming fields. Different response models showed varying levels of performance between JSON mode and Tool Calling, indicating that JSON mode requires more careful optimisation.

Ensuring Consistent Timestamp Formats with Language Models

Gemini can Understand timestamps in language model outputs, but they can be inconsistent. Video content timestamps vary between HH:MM:SS and MM:SS formats, causing parsing errors and calculations. This post presents a technique to handle timestamps for clips and films without formatting issues.

We combine Pydantic's data validation with custom parsing for consistent timestamp handling. You'll learn to process timestamps in any format, reducing errors in video content workflows. Kinda like how we ensured matching language in multilingal summarization by adding a simple field.

The post provides a solution using Pydantic to improve timestamp handling in language model projects. This method addresses format inconsistencies and enables timestamp processing.

Instructor Proposal: Integrating Jinja Templating

As the creator of Instructor, I've always aimed to keep our product development streamlined and avoid unnecessary complexity. However, I'm now convinced that it's time to incorporate better templating into our data structure, specifically by integrating Jinja.

This decision serves multiple purposes:

  1. It addresses the growing complexity in my prompt formatting needs
  2. It allows us to differentiate ourselves from the standard library while adding proven utility.
  3. It aligns with the practices I've consistently employed in both production and client code.
  4. It provides an opportunity to introduce API changes that have been tested in private versions of Instructor.

Why Jinja is the Right Choice

  1. Formatting Capabilities
  2. Prompt formatting complexity has increased.
  3. List iteration and conditional implementation are necessary for formatting.
  4. This improves chunk generation, few shots, and dynamic rules.

  5. Validation

  6. Jinja template variables serve rendering and validation purposes.
  7. Pydantic's validation context allows access to template variables in validation functions.

  8. Versioning and Logging

  9. Render variable separation enhances prompt versioning and logging.
  10. Template variable diffing simplifies prompt change comparisons.

By integrating Jinja into Instructor, we're not just adding a feature; we're enhancing our ability to handle complex formatting, improve validation processes, and streamline our versioning and logging capabilities. This addition will significantly boost the power and flexibility of Instructor, making it an even more robust tool for our users.

Enhancing Formatting Capabilities

In Instructor, we propose implementing a new context keyword in our create methods. This addition will allow users to render the prompt using a provided context, leveraging Jinja's templating capabilities. Here's how it would work:

  1. Users pass a context dictionary to the create method.
  2. The prompt template, written in Jinja syntax, is defined in the content field of the message.
  3. Instructor renders the prompt using the provided context, filling in the template variables.

This approach offers these benefits:

  • Separation of prompt structure and dynamic content
  • Management of complex prompts with conditionals and loops
  • Reusability of prompt templates across different contexts

Let's look at an example to illustrate this feature:

client.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user", 
            "content": """
                You are a {{ role }} tasks with the following question 

                <question>
                {{ question }}
                </question>

                Use the following context to answer the question, make sure to return [id] for every citation:

                <context>
                {% for chunk in context %}
                  <context_chunk>
                    <id>{{ chunk.id }}</id>
                    <text>{{ chunk.text }}</text>
                  </context_chunk>
                {% endfor %}
                </context>

                {% if rules %}
                Make sure to follow these rules:

                {% for rule in rules %}
                  * {{ rule }}
                {% endfor %}
                {% endif %}
            """
        },
    ],
    context={
        "role": "professional educator", 
        "question": "What is the capital of France?", 
        "context": [
            {"id": 1, "text": "Paris is the capital of France."}, 
            {"id": 2, "text": "France is a country in Europe."}
        ], 
        "rules": ["Use markdown."]
    }
)

Validation

Let's consider a scenario where we redact words from text. By using ValidationInfo to access context and passing it to the validator and template, we can implement a system for handling sensitive information. This approach allows us to:

  1. Validate input to ensure it doesn't contain banned words.
  2. Redact patterns using regular expressions.
  3. Provide instructions to the language model about word usage restrictions.

Here's an example demonstrating this concept using Pydantic validators:

from pydantic import BaseModel, ValidationInfo, field_validator

class Response(BaseModel):
    text: str

    @field_validator('text')
    @classmethod
    def no_banned_words(cls, v: str, info: ValidationInfo):
        context = info.context
        if context:
            banned_words = context.get('banned_words', set())
            banned_words_found = [word for word in banned_words if word.lower() in v.lower()]
            if banned_words_found:
                raise ValueError(f"Banned words found in text: {', '.join(banned_words_found)}, rewrite it but just without the banned words")
        return v

    @field_validator('text')
    @classmethod
    def redact_regex(cls, v: str, info: ValidationInfo):
        context = info.context
        if context:
            redact_patterns = context.get('redact_patterns', [])
            for pattern in redact_patterns:
                v = re.sub(pattern, '****', v)
        return v

response = client.create(
    model="gpt-4o",
    response_model=Response,
    messages=[
        {
            "role": "user", 
            "content": """
                Write about a {{ topic }}

                {% if banned_words %}
                You must not use the following banned words:

                <banned_words>
                {% for word in banned_words %}
                * {{ word }}
                {% endfor %}
                </banned_words>
                {% endif %}
              """
        },
    ],
    context={
        "topic": "jason and now his phone number is 123-456-7890"
        "banned_words": ["jason"],
        "redact_patterns": [
            r"\b\d{3}[-.]?\d{3}[-.]?\d{4}\b",  # Phone number pattern
            r"\b\d{3}-\d{2}-\d{4}\b",          # SSN pattern
        ],
    },
    max_retries=3,
)

print(response.text)
# > While i can't say his name anymore, his phone number is ****

Better Versioning and Logging

With the separation of prompt templates and variables, we gain several advantages:

  1. Version Control: We can now version the templates and retrieve the appropriate one for a given prompt. This allows for better management of template history, diffing and comparison.

  2. Enhanced Logging: The separation facilitates structured logging, enabling easier debugging and integration with various logging sinks, databases, and observability tools like OpenTelemetry.

  3. Security: Sensitive information in variables can be handled separately from the templates, allowing for better access control and data protection.

This separation of concerns adheres to best practices in software design, resulting in a more maintainable, scalable, and robust system for managing prompts and their associated data.

Side effect of Context also being Pydantic Models

Since they are just python objects we can use Pydantic models to validate the context and also control how they are rendered, so even secret information can be dynamically rendered! Consider using secret string to pass in sensitive information to the llm.

from pydantic import BaseModel, SecretStr

class UserContext(BaseModel):
    name: str
    address: SecretStr

class Address(BaseModel):
    street: SecretStr
    city: str
    state: str
    zipcode: str

def normalize_address(address: Address):
  context = UserContext(username="scolvin", address=address)
  address = client.create(
      model="gpt-4o",
      messages=[
          {
              "role": "user", 
              "content": "{{ user.name }} is `{{ user.address.get_secret_value() }}`, normalize it to an address object"
          },
      ],
      context={"user": context},
  )
  print(context)
  # > UserContext(username='jliu', address="******")
  print(address)
  # > Address(street='******', city="Toronto", state="Ontario", zipcode="M5A 0J3")
  logger.info(f"Normalized address: {address}", extra={"user_context": context, "address": address})
  return address

This approach offers several advantages:

  1. Secure logging: You can confidently log your template variables without risking the exposure of sensitive information.
  2. Type safety: Pydantic models provide type checking and validation, reducing the risk of errors.
  3. Flexibility: You can easily control how different types of data are displayed or used in templates.

Why should I use prompt caching?

Developers often face two key challenges when working with large context - Slow response times and high costs. This is especially true when we're making multiple of these calls over time, severely impacting the cost and latency of our applications. With Anthropic's new prompt caching feature, we can easily solve both of these issues.

Since the new feature is still in beta, we're going to wait for it to be generally avaliable before we integrate it into instructor. In the meantime, we've put together a quickstart guide on how to use the feature in your own applications.