Skip to content

LLM

Building an LLM-based Reranker for your RAG pipeline

Are you struggling with irrelevant search results in your Retrieval-Augmented Generation (RAG) pipeline?

Imagine having a powerful tool that can intelligently reassess and reorder your search results, significantly improving their relevance to user queries.

In this blog post, we'll show you how to create an LLM-based reranker using Instructor and Pydantic. This approach will:

  • Enhance the accuracy of your search results
  • Leverage the power of large language models (LLMs)
  • Utilize structured outputs for precise information retrieval

By the end of this tutorial, you'll be able to implement a llm reranker to label your synthetic data for fine-tuning a traditional reranker, or to build out an evaluation pipeline for your RAG system. Let's dive in!

Setting Up the Environment

First, let's set up our environment with the necessary imports:

import instructor
from openai import OpenAI
from pydantic import BaseModel, Field, field_validator

client = instructor.from_openai(OpenAI())

We're using the instructor library, which integrates seamlessly with OpenAI's API and Pydantic for structured outputs.

Defining the Reranking Models

We'll use Pydantic to define our Label and RerankedResults models that structure the output of our LLM:

Notice that not only do I reference the chunk_id in the label class, I also asked a language model to use chain of thought. This is very useful for using models like 4o Mini or Claude, but not necessarily if we plan to use the o1-mini and o1-preview models.

class Label(BaseModel):
    chunk_id: int = Field(description="The unique identifier of the text chunk")
    chain_of_thought: str = Field(description="The reasoning process used to evaluate the relevance")
    relevancy: int = Field(
        description="Relevancy score from 0 to 10, where 10 is most relevant",
        ge=0,
        le=10,
    )


class RerankedResults(BaseModel):
    labels: list[Label] = Field(description="List of labeled and ranked chunks")

    @field_validator("labels")
    @classmethod
    def model_validate(cls, v: list[Label]) -> list[Label]:
        return sorted(v, key=lambda x: x.relevancy, reverse=True)

These models ensure that our LLM's output is structured and includes a list of labeled chunks with their relevancy scores. The RerankedResults model includes a validator that automatically sorts the labels by relevancy in descending order.

Creating the Reranker Function

Next, we'll create a function that uses our LLM to rerank a list of text chunks based on their relevance to a query:

def rerank_results(query: str, chunks: list[dict]) -> RerankedResults:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        response_model=RerankedResults,
        messages=[
            {
                "role": "system",
                "content": """
                You are an expert search result ranker. Your task is to evaluate the relevance of each text chunk to the given query and assign a relevancy score.

                For each chunk:
                1. Analyze its content in relation to the query.
                2. Provide a chain of thought explaining your reasoning.
                3. Assign a relevancy score from 0 to 10, where 10 is most relevant.

                Be objective and consistent in your evaluations.
                """,
            },
            {
                "role": "user",
                "content": """
                <query>{{ query }}</query>

                <chunks_to_rank>
                {% for chunk in chunks %}
                <chunk id="{{ chunk.id }}">
                    {{ chunk.text }}
                </chunk>
                {% endfor %}
                </chunks_to_rank>

                Please provide a RerankedResults object with a Label for each chunk.
                """,
            },
        ],
        context={"query": query, "chunks": chunks},
    )

This function takes a query and a list of text chunks as input, sends them to the LLM with a predefined prompt, and returns a structured RerankedResults object. Thanks to instructor we can use jinja templating to inject the query and chunks into the prompt by passing in the context parameter.

Testing the Reranker

To test our LLM-based reranker, we can create a sample query and a list of text chunks. Here's an example of how to use the reranker:

def main():
    query = "What are the health benefits of regular exercise?"
    chunks = [
        {
            "id": 0,
            "text": "Regular exercise can improve cardiovascular health and reduce the risk of heart disease.",
        },
        {
            "id": 1,
            "text": "The price of gym memberships varies widely depending on location and facilities.",
        },
        {
            "id": 2,
            "text": "Exercise has been shown to boost mood and reduce symptoms of depression and anxiety.",
        },
        {
            "id": 3,
            "text": "Proper nutrition is essential for maintaining a healthy lifestyle.",
        },
        {
            "id": 4,
            "text": "Strength training can increase muscle mass and improve bone density, especially important as we age.",
        },
    ]

    results = rerank_results(query, chunks)

    print("Reranked results:")
    for label in results.labels:
        print(f"Chunk {label.chunk_id} (Relevancy: {label.relevancy}):")
        print(f"Text: {chunks[label.chunk_id]['text']}")
        print(f"Reasoning: {label.chain_of_thought}")
        print()

if __name__ == "__main__":
    main()

This test demonstrates how the reranker evaluates and sorts the chunks based on their relevance to the query. The full implementation can be found in the examples/reranker/run.py file.

If you want to extend this example, you could use the rerank_results function to label synthetic data for fine-tuning a traditional reranker, or to build out an evaluation pipeline for your RAG system.

Moreover, we could also add validators to the Label.chunk_id field to ensure that the chunk_id is present in the chunks list. This might be useful if labels are uuids or complex strings and we want to ensure that the chunk_id is a valid index for the chunks list.

heres an example

class Label(BaseModel):
    chunk_id: int = Field(description="The unique identifier of the text chunk")
    ...

    @field_validator("chunk_id")
    @classmethod
    def validate_chunk_id(cls, v: int, info: ValidationInfo) -> int:
        context = info.context
        chunks = context["chunks"]
        if v not in [chunk["id"] for chunk in chunks]:
            raise ValueError(f"Chunk with id {v} not found, must be one of {[chunk['id'] for chunk in chunks]}")
        return v

This will automatically check that the chunk_id is present in the chunks list and raise a ValueError if it is not, where context is the context dictionary that we passed into the rerank_results function.

Building a Pairwise LLM Judge with Instructor and Pydantic

In this blog post, we'll explore how to create a pairwise LLM judge using Instructor and Pydantic. This judge will evaluate the relevance between a question and a piece of text, demonstrating a practical application of structured outputs in language model interactions.

Introduction

Evaluating text relevance is a common task in natural language processing and information retrieval. By leveraging large language models (LLMs) and structured outputs, we can create a system that judges the similarity or relevance between a question and a given text.

Setting Up the Environment

First, let's set up our environment with the necessary imports:

import instructor
import openai
from pydantic import BaseModel, Field

client = instructor.from_openai(openai.OpenAI())

Here, we're using the instructor library, which integrates seamlessly with OpenAI's API and Pydantic for structured outputs.

Defining the Judgment Model

We'll use Pydantic to define a Judgment model that structures the output of our LLM:

class Judgment(BaseModel):
    thought: str = Field(
        description="The step-by-step reasoning process used to analyze the question and text"
    )
    justification: str = Field(
        description="Explanation for the similarity judgment, detailing key factors that led to the conclusion"
    )
    similarity: bool = Field(
        description="Boolean judgment indicating whether the question and text are similar or relevant (True) or not (False)"
    )

This model ensures that our LLM's output is structured and includes a thought process, justification, and a boolean similarity judgment.

Creating the Judge Function

Next, we'll create a function that uses our LLM to judge the relevance between a question and a text:

def judge_relevance(question: str, text: str) -> Judgment:
    return client.chat.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "system",
                "content": """
                    You are tasked with comparing a question and a piece of text to determine if they are relevant to each other or similar in some way. Your goal is to analyze the content, context, and potential connections between the two.

                    To determine if the question and text are relevant or similar, please follow these steps:

                    1. Carefully read and understand both the question and the text.
                    2. Identify the main topic, keywords, and concepts in the question.
                    3. Analyze the text for any mention of these topics, keywords, or concepts.
                    4. Consider any potential indirect connections or implications that might link the question and text.
                    5. Evaluate the overall context and purpose of both the question and the text.

                    As you go through this process, please use a chain of thought approach. Write out your reasoning for each step inside <thought> tags.

                    After your analysis, provide a boolean judgment on whether the question and text are similar or relevant to each other. Use "true" if they are similar or relevant, and "false" if they are not.

                    Before giving your final judgment, provide a justification for your decision. Explain the key factors that led to your conclusion.

                    Please ensure your analysis is thorough, impartial, and based on the content provided.
                """
            },
            {
                "role": "user",
                "content": """
                    Here is the question:

                    <question>
                    {{question}}
                    </question>

                    Here is the text:
                    <text>
                    {{text}}
                    </text>
                """
            },
            },
        ],
        response_model=Judgment,
        context={"question": question, "text": text},
    )

This function takes a question and a text as input, sends them to the LLM with a predefined prompt, and returns a structured Judgment object.

Testing the Judge

To test our pairwise LLM judge, we can create a set of test pairs and evaluate the judge's performance:

if __name__ == "__main__":
    test_pairs = [
        {
            "question": "What are the main causes of climate change?",
            "text": "Global warming is primarily caused by human activities, such as burning fossil fuels, deforestation, and industrial processes. These activities release greenhouse gases into the atmosphere, trapping heat and leading to a rise in global temperatures.",
            "is_similar": True,
        },
        # ... (other test pairs)
    ]

    score = 0
    for pair in test_pairs:
        result = judge_relevance(pair["question"], pair["text"])
        if result.similarity == pair["is_similar"]:
            score += 1

    print(f"Score: {score}/{len(test_pairs)}")
    # > Score 9/10

This test loop runs the judge on each pair and compares the result to a predetermined similarity value, calculating an overall score.

Conclusion

By combining Instructor, Pydantic, and OpenAI's language models, we've created a powerful tool for judging text relevance. This approach demonstrates the flexibility and power of structured outputs in LLM applications.

The pairwise LLM judge we've built can be used in various scenarios, such as:

  1. Improving search relevance in information retrieval systems
  2. Evaluating the quality of question-answering systems
  3. Assisting in content recommendation algorithms
  4. Automating parts of the content moderation process

As you explore this technique, consider how you might extend or adapt it for your specific use cases. The combination of structured outputs and large language models opens up a world of possibilities for creating intelligent, interpretable AI systems.