Prioritize Uncertain Examples
When we have a large pool of unlabeled examples that could be used in a prompt, how should we decide which examples to manually label?
Active prompting is a method used to identify the most effective examples for human annotation. The process involves four key steps:
- Uncertainty Estimation: Assess the uncertainty of the LLM's predictions on each possible example
- Selection: Choose the most uncertain examples for human annotation
- Annotation: Have humans label the selected examples
- Inference: Use the newly labeled data to improve the LLM's performance
Uncertainty Estimation¶
In this step, we define an unsupervised method to measure the uncertainty of an LLM in answering a given example.
Uncertainty Estimation Example
Let's say we ask an LLM the following query:
query = "Classify the sentiment of this sentence as positive or negative: I am very excited today."
and the LLM returns:
response = "positive"
The goal of uncertainty estimation is to answer: How sure is the LLM in this response?
In order to do this, we query the LLM with the same example k times. Then, we use the k responses to determine how dissimmilar these responses are. Three possible metrics1 are:
- Disagreement: Ratio of unique responses to total responses.
- Entropy: Measurement based on frequency of each response.
- Variance: Calculation of the spread of numerical responses.
Below is an example of uncertainty estimation for a single input example using the disagreement uncertainty metric.
import instructor
from pydantic import BaseModel
from openai import OpenAI
class Response(BaseModel):
height: int
client = instructor.from_openai(OpenAI())
def query_llm():
return client.chat.completions.create(
model="gpt-4o",
response_model=Response,
messages=[
{
"role": "user",
"content": "How tall is the Empire State Building in meters?",
}
],
)
def calculate_disagreement(responses):
unique_responses = set(responses)
h = len(unique_responses)
return h / k
if __name__ == "__main__":
k = 5 # (1)!
responses = [query_llm() for _ in range(k)] # Query the LLM k times
for response in responses:
print(response)
#> height=443
#> height=443
#> height=443
#> height=443
#> height=381
print(
calculate_disagreement([response.height for response in responses])
) # Calculate the uncertainty metric
#> 0.4
- k is the number of times to query the LLM with a single unlabeled example
This process will then be repeated for all unlabeled examples.
Selection & Annotation¶
Once we have a set of examples and their uncertainties, we can select n of them to be annotated by humans. Here, we choose the examples with the highest uncertainties.
Inference¶
Now, each time the LLM is prompted, we can include the newly-annotated examples.
References¶
1: Active Prompting with Chain-of-Thought for Large Language Models
*: The Prompt Report: A Systematic Survey of Prompting Techniques