Prioritize Uncertain Examples

When we have a large pool of unlabeled examples that could be used in a prompt, how should we decide which examples to manually label?

Active prompting is a method used to identify the most effective examples for human annotation. The process involves four key steps:

Uncertainty Estimation: Assess the uncertainty of the LLM's predictions on each possible example
Selection: Choose the most uncertain examples for human annotation
Annotation: Have humans label the selected examples
Inference: Use the newly labeled data to improve the LLM's performance

Uncertainty Estimation¶

In this step, we define an unsupervised method to measure the uncertainty of an LLM in answering a given example.

Uncertainty Estimation Example

Let's say we ask an LLM the following query:

query = "Classify the sentiment of this sentence as positive or negative: I am very excited today."

and the LLM returns:

response = "positive"

The goal of uncertainty estimation is to answer: How sure is the LLM in this response?

In order to do this, we query the LLM with the same example k times. Then, we use the k responses to determine how dissimmilar these responses are. Three possible metrics¹ are:

Disagreement: Ratio of unique responses to total responses.
Entropy: Measurement based on frequency of each response.
Variance: Calculation of the spread of numerical responses.

Below is an example of uncertainty estimation for a single input example using the disagreement uncertainty metric.

import instructor
from pydantic import BaseModel
from openai import OpenAI


class Response(BaseModel):
    height: int


client = instructor.from_openai(OpenAI())


def query_llm():
    return client.chat.completions.create(
        model="gpt-4o",
        response_model=Response,
        messages=[
            {
                "role": "user",
                "content": "How tall is the Empire State Building in meters?",
            }
        ],
    )


def calculate_disagreement(responses):
    unique_responses = set(responses)
    h = len(unique_responses)
    return h / k


if __name__ == "__main__":
    k = 5  # (1)!
    responses = [query_llm() for _ in range(k)]  # Query the LLM k times
    for response in responses:
        print(response)
        #> height=443
        #> height=443
        #> height=443
        #> height=443
        #> height=381

    print(
        calculate_disagreement([response.height for response in responses])
    )  # Calculate the uncertainty metric
    #> 0.4

k is the number of times to query the LLM with a single unlabeled example

This process will then be repeated for all unlabeled examples.

Selection & Annotation¶

Once we have a set of examples and their uncertainties, we can select n of them to be annotated by humans. Here, we choose the examples with the highest uncertainties.

Inference¶

Now, each time the LLM is prompted, we can include the newly-annotated examples.

References¶

¹: Active Prompting with Chain-of-Thought for Large Language Models

^*: The Prompt Report: A Systematic Survey of Prompting Techniques