Skip to content

2024

Using Structured Outputs to convert messy tables into tidy data

Why is this a problem?

Messy data exports are a common problem. Whether it's multiple headers in the table, implicit relationships that make analysis a pain or even just merged cells, using instructor with structured outputs makes it easy to convert messy tables into tidy data, even if all you have is just an image of the table as we'll see below.

Let's look at the following table as an example. It makes analysis unnecessarily difficult because it hides data relationships through empty cells and implicit repetition. If we were using it for data analysis, cleaning it manually would be a huge nightmare.

Structured Outputs with Writer now supported

We're excited to announce that instructor now supports Writer's enterprise-grade LLMs, including their latest Palmyra X 004 model. This integration enables structured outputs and enterprise AI workflows with Writer's powerful language models.

Getting Started

First, make sure that you've signed up for an account on Writer and obtained an API key. Once you've done so, install instructor with Writer support by running pip install instructor[writer] in your terminal.

Make sure to set the WRITER_API_KEY environment variable with your Writer API key or pass it as an argument to the Writer constructor.

PDF Processing with Structured Outputs with Gemini

In this post, we'll explore how to use Google's Gemini model with Instructor to analyse the Gemini 1.5 Pro Paper and extract a structured summary.

The Problem

Processing PDFs programmatically has always been painful. The typical approaches all have significant drawbacks:

  • PDF parsing libraries require complex rules and break easily
  • OCR solutions are slow and error-prone
  • Specialized PDF APIs are expensive and require additional integration
  • LLM solutions often need complex document chunking and embedding pipelines

What if we could just hand a PDF to an LLM and get structured data back? With Gemini's multimodal capabilities and Instructor's structured output handling, we can do exactly that.

Quick Setup

First, install the required packages:

pip install "instructor[google-generativeai]"

Then, here's all the code you need:

import instructor
import google.generativeai as genai
from google.ai.generativelanguage_v1beta.types.file import File
from pydantic import BaseModel
import time

# Initialize the client
client = instructor.from_gemini(
    client=genai.GenerativeModel(
        model_name="models/gemini-1.5-flash-latest",
    )
)

# Define your output structure
class Summary(BaseModel):
    summary: str

# Upload the PDF
file = genai.upload_file("path/to/your.pdf")

# Wait for file to finish processing
while file.state != File.State.ACTIVE:
    time.sleep(1)
    file = genai.get_file(file.name)
    print(f"File is still uploading, state: {file.state}")

print(f"File is now active, state: {file.state}")
print(file)

resp = client.chat.completions.create(
    messages=[
        {"role": "user", "content": ["Summarize the following file", file]},
    ],
    response_model=Summary,
)

print(resp.summary)
Expand to see Raw Results
summary="Gemini 1.5 Pro is a highly compute-efficient multimodal mixture-of-experts model capable of recalling and reasoning over fine-grained information from millions of tokens of context, including multiple long documents and hours of video and audio. It achieves near-perfect recall on long-context retrieval tasks across modalities, improves the state-of-the-art in long-document QA, long-video QA and long-context ASR, and matches or surpasses Gemini 1.0 Ultra's state-of-the-art performance across a broad set of benchmarks. Gemini 1.5 Pro is built to handle extremely long contexts; it has the ability to recall and reason over fine-grained information from up to at least 10M tokens. This scale is unprecedented among contemporary large language models (LLMs), and enables the processing of long-form mixed-modality inputs including entire collections of documents, multiple hours of video, and almost five days long of audio. Gemini 1.5 Pro surpasses Gemini 1.0 Pro and performs at a similar level to 1.0 Ultra on a wide array of benchmarks while requiring significantly less compute to train. It can recall information amidst distractor context, and it can learn to translate a new language from a single set of linguistic documentation. With only instructional materials (a 500-page reference grammar, a dictionary, and ≈ 400 extra parallel sentences) all provided in context, Gemini 1.5 Pro is capable of learning to translate from English to Kalamang, a Papuan language with fewer than 200 speakers, and therefore almost no online presence."

Benefits

The combination of Gemini and Instructor offers several key advantages over traditional PDF processing approaches:

Simple Integration - Unlike traditional approaches that require complex document processing pipelines, chunking strategies, and embedding databases, you can directly process PDFs with just a few lines of code. This dramatically reduces development time and maintenance overhead.

Structured Output - Instructor's Pydantic integration ensures you get exactly the data structure you need. The model's outputs are automatically validated and typed, making it easier to build reliable applications. If the extraction fails, Instructor automatically handles the retries for you with support for custom retry logic using tenacity.

Multimodal Support - Gemini's multimodal capabilities mean this same approach works for various file types. You can process images, videos, and audio files all in the same api request. Check out our multimodal processing guide to see how we extract structured data from travel videos.

Conclusion

Working with PDFs doesn't have to be complicated.

By combining Gemini's multimodal capabilities with Instructor's structured output handling, we can transform complex document processing into simple, Pythonic code.

No more wrestling with parsing rules, managing embeddings, or building complex pipelines – just define your data model and let the LLM do the heavy lifting.

If you liked this, give instructor a try today and see how much easier structured outputs makes working with LLMs become. Get started with Instructor today!

Do I Still Need Instructor with Google's New OpenAI Integration?

Google recently launched OpenAI client compatibility for Gemini.

While this is a significant step forward for developers by simplifying Gemini model interactions, you absolutely still need instructor.

If you're unfamiliar with instructor, we provide a simple interface to get structured outputs from LLMs across different providers.

This makes it easy to switch between providers, get reliable outputs from language models and ultimately build production grade LLM applications.

Building an LLM-based Reranker for your RAG pipeline

Are you struggling with irrelevant search results in your Retrieval-Augmented Generation (RAG) pipeline?

Imagine having a powerful tool that can intelligently reassess and reorder your search results, significantly improving their relevance to user queries.

In this blog post, we'll show you how to create an LLM-based reranker using Instructor and Pydantic. This approach will:

  • Enhance the accuracy of your search results
  • Leverage the power of large language models (LLMs)
  • Utilize structured outputs for precise information retrieval

By the end of this tutorial, you'll be able to implement a llm reranker to label your synthetic data for fine-tuning a traditional reranker, or to build out an evaluation pipeline for your RAG system. Let's dive in!

Structured Outputs with Multimodal Gemini

In this post, we'll explore how to use Google's Gemini model with Instructor to analyze travel videos and extract structured recommendations. This powerful combination allows us to process multimodal inputs (video) and generate structured outputs using Pydantic models. This post was done in collaboration with Kino.ai, a company that uses instructor to do structured extraction from multimodal inputs to improve search for film makers.

Setting Up the Environment

First, let's set up our environment with the necessary libraries:

from pydantic import BaseModel
import instructor
import google.generativeai as genai

Structured Outputs and Prompt Caching with Anthropic

Anthropic's ecosystem now offers two powerful features for AI developers: structured outputs and prompt caching. These advancements enable more efficient use of large language models (LLMs). This guide demonstrates how to leverage these features with the Instructor library to enhance your AI applications.

Structured Outputs with Anthropic and Instructor

Instructor now offers seamless integration with Anthropic's powerful language models, allowing developers to easily create structured outputs using Pydantic models. This integration simplifies the process of extracting specific information from AI-generated responses.

Flashcard generator with Instructor + Burr

Flashcards help break down complex topics and learn anything from biology to a new language or lines for a play. This blog will show how to use LLMs to generate flashcards and kickstart your learning!

Instructor lets us get structured outputs from LLMs reliably, and Burr helps create an LLM application that's easy to understand and debug. It comes with Burr UI, a free, open-source, and local-first tool for observability, annotations, and more!

Audio Support in OpenAI's Chat Completions API

OpenAI has recently introduced audio support in their Chat Completions API, opening up exciting new possibilities for developers working with audio and text interactions. This feature is powered by the new gpt-4o-audio-preview model, which brings advanced voice capabilities to the familiar Chat Completions API interface.