Skip to content

Subscribe to our Newsletter for Updates and Tips

Analyzing Youtube Transcripts with Instructor

Extracting Chapter Information

Code Snippets

As always, the code is readily available in our examples/youtube folder in our repo for your reference in the run.py file.

In this post, we'll show you how to summarise Youtube video transcripts into distinct chapters using instructor before exploring some ways you can adapt the code to different applications.

By the end of this article, you'll be able to build an application as per the video below.

Why Instructor is the best way to get JSON from LLMs

Large Language Models (LLMs) like GPT are incredibly powerful, but getting them to return well-formatted JSON can be challenging. This is where the Instructor library shines. Instructor allows you to easily map LLM outputs to JSON data using Python type annotations and Pydantic models.

Instructor makes it easy to get structured data like JSON from LLMs like GPT-3.5, GPT-4, GPT-4-Vision, and open-source models including Mistral/Mixtral, Anyscale, Ollama, and llama-cpp-python.

It stands out for its simplicity, transparency, and user-centric design, built on top of Pydantic. Instructor helps you manage validation context, retries with Tenacity, and streaming Lists and Partial responses.

The Simple Patch for JSON LLM Outputs

Instructor works as a lightweight patch over the OpenAI Python SDK. To use it, you simply apply the patch to your OpenAI client:

import instructor
import openai

client = instructor.from_openai(openai.OpenAI())

Then, you can pass a response_model parameter to the completions.create or chat.completions.create methods. This parameter takes in a Pydantic model class that defines the JSON structure you want the LLM output mapped to. Just like response_model when using FastAPI.

Here's an example of a response_model for a simple user profile:

from pydantic import BaseModel

class User(BaseModel):
    name: str
    age: int
    email: str

client = instructor.from_openai(openai.OpenAI())

user = client.chat.completions.create(
    model="gpt-3.5-turbo",
    response_model=User,
    messages=[
        {
            "role": "user",
            "content": "Extract the user's name, age, and email from this: John Doe is 25 years old. His email is [email protected]"
        }
    ]
)

print(user.model_dump())
# > { 
#     "name": "John Doe",
#     "age": 25,
#     "email": "[email protected]"
#   }

Instructor extracts the JSON data from the LLM output and returns an instance of your specified Pydantic model. You can then use the model_dump() method to serialize the model instance to a JSON string.

Some key benefits of Instructor:

  • Zero new syntax to learn - it builds on standard Python type hints
  • Seamless integration with existing OpenAI SDK code
  • Incremental, zero-overhead adoption path
  • Direct access to the messages parameter for flexible prompt engineering
  • Broad compatibility with any OpenAI SDK-compatible platform or provider

Pydantic: More Powerful than Plain Dictionaries

You might be wondering, why use Pydantic models instead of just returning a dictionary of key-value pairs? While a dictionary could hold JSON data, Pydantic models provide several powerful advantages:

  1. Type validation: Pydantic models enforce the types of the fields. If the LLM returns an incorrect type (e.g. a string for an int field), it will raise a validation error.

  2. Field requirements: You can mark fields as required or optional. Pydantic will raise an error if a required field is missing.

  3. Default values: You can specify default values for fields that aren't always present.

  4. Advanced types: Pydantic supports more advanced field types like dates, UUIDs, URLs, lists, nested models, and more.

  5. Serialization: Pydantic models can be easily serialized to JSON, which is helpful for saving results or passing them to other systems.

  6. IDE support: Because Pydantic models are defined as classes, IDEs can provide autocompletion, type checking, and other helpful features when working with the JSON data.

So while dictionaries can work for very simple JSON structures, Pydantic models are far more powerful for working with complex, validated JSON in a maintainable way.

JSON from LLMs Made Easy

Instructor and Pydantic together provide a fantastic way to extract and work with JSON data from LLMs. The lightweight patching of Instructor combined with the powerful validation and typing of Pydantic models makes it easy to integrate JSON outputs into your LLM-powered applications. Give Instructor a try and see how much easier it makes getting JSON from LLMs!

Enhancing RAG with Time Filters Using Instructor

Retrieval-augmented generation (RAG) systems often need to handle queries with time-based constraints, like "What new features were released last quarter?" or "Show me support tickets from the past week." Effective time filtering is crucial for providing accurate, relevant responses.

Instructor is a Python library that simplifies integrating large language models (LLMs) with data sources and APIs. It allows defining structured output models using Pydantic, which can be used as prompts or to parse LLM outputs.

Modeling Time Filters

To handle time filters, we can define a Pydantic model representing a time range:

from datetime import datetime
from typing import Optional
from pydantic import BaseModel

class TimeFilter(BaseModel):
    start_date: Optional[datetime] = None
    end_date: Optional[datetime] = None

The TimeFilter model can represent an absolute date range or a relative time range like "last week" or "previous month."

We can then combine this with a search query string:

class SearchQuery(BaseModel):
    query: str
    time_filter: TimeFilter

Prompting the LLM

Using Instructor, we can prompt the LLM to generate a SearchQuery object based on the user's query:

import instructor
from openai import OpenAI

client = instructor.from_openai(OpenAI())

response = client.chat.completions.create(
    model="gpt-4o",
    response_model=SearchQuery,
    messages=[
        {
            "role": "system", 
            "content": "You are a query generator for customer support tickets. The current date is 2024-02-17"},
        {
            "role": "user", 
            "content": "Show me customer support tickets opened in the past week."
        },
    ],
)

{
    "query": "Show me customer support tickets opened in the past week.",
    "time_filter": {
        "start_date": "2024-02-10T00:00:00",
        "end_date": "2024-02-17T00:00:00"
    }
}

Nuances in dates and timezones

When working with time-based queries, it's important to consider the nuances of dates, timezones, and publication times. Depending on the data source, the user's location, and when the content was originally published, the definition of "past week" or "last month" may vary.

To handle this, you'll want to design your TimeFilter model to intelligently reason about these relative time periods. This could involve:

  • Defaulting to the user's local timezone if available, or using a consistent default like UTC
  • Defining clear rules for how to calculate the start and end of relative periods like "week" or "month"
  • e.g. does "past week" mean the last 7 days or the previous Sunday-Saturday range?
  • Allowing for flexibility in how users specify dates (exact datetimes, just dates, natural language phrases)
  • Validating and normalizing user input to fit the expected TimeFilter format
  • Considering the original publication timestamp of the content, not just the current date
  • e.g. "articles published in the last month" should look at the publish date, not the query date

By building this logic into the TimeFilter model, you can abstract away the complexity and provide a consistent interface for the rest of your RAG system to work with standardized absolute datetime ranges

Of course, there may be edge cases or ambiguities that are hard to resolve programmatically. In these situations, you may need to prompt the user for clarification or make a best guess based on the available information. The key is to strive for a balance of flexibility and consistency in how you handle time-based queries, factoring in publication dates when relevant.

By modeling time filters with Pydantic and leveraging Instructor, RAG systems can effectively handle time-based queries. Clear prompts, careful model design, and appropriate parsing strategies enable accurate retrieval of information within specific time frames, enhancing the system's overall relevance and accuracy.

Why Logfire is a perfect fit for FastAPI + Instructor

Logfire is a new tool that provides key insight into your application with Open Telemtry. Instead of using ad-hoc print statements, Logfire helps to profile every part of your application and is integrated directly into Pydantic and FastAPI, two popular libraries amongst Instructor users.

In short, this is the secret sauce to help you get your application to the finish line and beyond. We'll show you how to easily integrate Logfire into FastAPI, one of the most popular choices amongst users of Instructor using two examples

  1. Data Extraction from a single User Query
  2. Using asyncio to process multiple users in parallel
  3. Streaming multiple objects using an Iterable so that they're avaliable on demand

Logfire

Introduction

Logfire is a new observability platform coming from the creators of Pydantic. It integrates almost seamlessly with many of your favourite libraries such as Pydantic, HTTPx and Instructor. In this article, we'll show you how to use Logfire with Instructor to gain visibility into the performance of your entire application.

We'll walk through the following examples

  1. Classifying scam emails using Instructor
  2. Performing simple validation using the llm_validator
  3. Extracting data into a markdown table from an infographic with GPT4V

Announcing instructor=1.0.0

Over the past 10 months, we've build up instructor with the principle of 'easy to try, and easy to delete'. We accomplished this by patching the openai client with the instructor package and adding new arguments like response_model, max_retries, and validation_context. As a result I truely believe isntructor is the best way to get structured data out of llm apis.

But as a result, we've been a bit stuck on getting typing to work well while giving you more control at development time. I'm excited to launch version 1.0.0 which cleans up the api w.r.t. typing without compromising the ease of use.

Matching Language in Multilingual Summarization Tasks

When asking language models to summarize text, there's a risk that the generated summary ends up in English, even if the source text is in another language. This is likely due to the instructions being provided in English, biasing the model towards English output.

In this post, we explore techniques to ensure the language of the generated summary matches the language of the source text. We leverage Pydantic for data validation and the langdetect library for language identification.

Announcing Anthropic Support

A special shoutout to Shreya for her contributions to the anthropic support. As of now, all features are operational with the exception of streaming support.

For those eager to experiment, simply patch the client with ANTHROPIC_JSON, which will enable you to leverage the anthropic client for making requests.

pip install instructor[anthropic]

Missing Features

Just want to acknowledge that we know that we are missing partial streaming and some better re-asking support for XML. We are working on it and will have it soon.

from pydantic import BaseModel
from typing import List
import anthropic
import instructor

# Patching the Anthropics client with the instructor for enhanced capabilities
anthropic_client = instructor.from_openai(
    create=anthropic.Anthropic().messages.create,
    mode=instructor.Mode.ANTHROPIC_JSON
)

class Properties(BaseModel):
    name: str
    value: str

class User(BaseModel):
    name: str
    age: int
    properties: List[Properties]

user_response = anthropic_client(
    model="claude-3-haiku-20240307",
    max_tokens=1024,
    max_retries=0,
    messages=[
        {
            "role": "user",
            "content": "Create a user for a model with a name, age, and properties.",
        }
    ],
    response_model=User,
)  # type: ignore

print(user_response.model_dump_json(indent=2))
"""
{
    "name": "John",
    "age": 25,
    "properties": [
        {
            "key": "favorite_color",
            "value": "blue"
        }
    ]
}

We're encountering challenges with deeply nested types and eagerly invite the community to test, provide feedback, and suggest necessary improvements as we enhance the anthropic client's support.

Simple Synthetic Data Generation

What that people have been using instructor for is to generate synthetic data rather than extracting data itself. We can even use the J-Schemo extra fields to give specific examples to control how we generate data.

Consider the example below. We'll likely generate very simple names.

from typing import Iterable
from pydantic import BaseModel
import instructor
from openai import OpenAI


# Define the UserDetail model
class UserDetail(BaseModel):
    name: str
    age: int


# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())


def generate_fake_users(count: int) -> Iterable[UserDetail]:
    return client.chat.completions.create(
        model="gpt-3.5-turbo",
        response_model=Iterable[UserDetail],
        messages=[
            {"role": "user", "content": f"Generate a {count} synthetic users"},
        ],
    )


for user in generate_fake_users(5):
    print(user)
    """
    name='Alice' age=25
    name='Bob' age=30
    name='Charlie' age=35
    name='David' age=40
    name='Eve' age=45
    """

Leveraging Simple Examples

We might want to set examples as part of the prompt by leveraging Pydantics configuration. We can set examples directly in the JSON scheme itself.

from typing import Iterable
from pydantic import BaseModel, Field
import instructor
from openai import OpenAI


# Define the UserDetail model
class UserDetail(BaseModel):
    name: str = Field(examples=["Timothee Chalamet", "Zendaya"])
    age: int


# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())


def generate_fake_users(count: int) -> Iterable[UserDetail]:
    return client.chat.completions.create(
        model="gpt-3.5-turbo",
        response_model=Iterable[UserDetail],
        messages=[
            {"role": "user", "content": f"Generate a {count} synthetic users"},
        ],
    )


for user in generate_fake_users(5):
    print(user)
    """
    name='Timothee Chalamet' age=25
    name='Zendaya' age=24
    name='Keanu Reeves' age=56
    name='Scarlett Johansson' age=36
    name='Chris Hemsworth' age=37
    """

By incorporating names of celebrities as examples, we have shifted towards generating synthetic data featuring well-known personalities, moving away from the simplistic, single-word names previously used.

Leveraging Complex Example

To effectively generate synthetic examples with more nuance, lets upgrade to the "gpt-4-turbo-preview" model, use model level examples rather than attribute level examples:

import instructor

from typing import Iterable
from pydantic import BaseModel, Field, ConfigDict
from openai import OpenAI


# Define the UserDetail model
class UserDetail(BaseModel):
    """Old Wizards"""
    name: str
    age: int

    model_config = ConfigDict(
        json_schema_extra={
            "examples": [
                {"name": "Gandalf the Grey", "age": 1000},
                {"name": "Albus Dumbledore", "age": 150},
            ]
        }
    )


# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())


def generate_fake_users(count: int) -> Iterable[UserDetail]:
    return client.chat.completions.create(
        model="gpt-4-turbo-preview",
        response_model=Iterable[UserDetail],
        messages=[
            {"role": "user", "content": f"Generate `{count}` synthetic examples"},
        ],
    )


for user in generate_fake_users(5):
    print(user)
    """
    name='Merlin' age=196
    name='Saruman the White' age=543
    name='Radagast the Brown' age=89
    name='Morgoth' age=901
    name='Filius Flitwick' age=105 
    """

Leveraging Descriptions

By adjusting the descriptions within our Pydantic models, we can subtly influence the nature of the synthetic data generated. This method allows for a more nuanced control over the output, ensuring that the generated data aligns more closely with our expectations or requirements.

For instance, specifying "Fancy French sounding names" as a description for the name field in our UserDetail model directs the generation process to produce names that fit this particular criterion, resulting in a dataset that is both diverse and tailored to specific linguistic characteristics.

import instructor

from typing import Iterable
from pydantic import BaseModel, Field
from openai import OpenAI


# Define the UserDetail model
class UserDetail(BaseModel):
    name: str = Field(description="Fancy French sounding names")
    age: int


# Patch the OpenAI client to enable the response_model functionality
client = instructor.from_openai(OpenAI())


def generate_fake_users(count: int) -> Iterable[UserDetail]:
    return client.chat.completions.create(
        model="gpt-3.5-turbo",
        response_model=Iterable[UserDetail],
        messages=[
            {"role": "user", "content": f"Generate `{count}` synthetic users"},
        ],
    )


for user in generate_fake_users(5):
    print(user)
    """
    name='Jean' age=25
    name='Claire' age=30
    name='Pierre' age=22
    name='Marie' age=27
    name='Luc' age=35
    """

Structured Output for Open Source and Local LLMs

Instructor has expanded its capabilities for language models. It started with API interactions via the OpenAI SDK, using Pydantic for structured data validation. Now, Instructor supports multiple models and platforms.

The integration of JSON mode improved adaptability to vision models and open source alternatives. This allows support for models from GPT and Mistral to models on Ollama and Hugging Face, using llama-cpp-python.

Instructor now works with cloud-based APIs and local models for structured data extraction. Developers can refer to our guide on Patching for information on using JSON mode with different models.

For learning about Instructor and Pydantic, we offer a course on Steering language models towards structured outputs.

The following sections show examples of Instructor's integration with platforms and local setups for structured outputs in AI projects.