Using Structured Outputs to convert messy tables into tidy data¶

Why is this a problem?¶

Messy data exports are a common problem. Whether it's multiple headers in the table, implicit relationships that make analysis a pain or even just merged cells, using instructor with structured outputs makes it easy to convert messy tables into tidy data, even if all you have is just an image of the table as we'll see below.

Let's look at the following table as an example. It makes analysis unnecessarily difficult because it hides data relationships through empty cells and implicit repetition. If we were using it for data analysis, cleaning it manually would be a huge nightmare.

For example, the subject ID (321) and GTT date only appear in the first row, with blank cells below implying these values apply to the following rows. This format breaks most pandas operations - you can't simply group by subject ID or merge with other datasets without complex preprocessing to fill in these missing values.

Instead, we have time series measurements spread across multiple rows, mixed data types in the insulin column (numbers and "lo off curve"), and repeated subject information hidden through empty cells. This means even simple operations like calculating mean glucose levels by time point or plotting glucose curves require data reshaping and careful handling of missing/special values.

Using Structured Outputs¶

Defining a custom type¶

Using tools like instructor to automatically convert untidy data into tidy format can save hours of preprocessing and reduce errors in your analysis pipeline.

Let's start by first defining a custom type that can parse the markdown table into a pandas dataframe.

from io import StringIO
from typing import Annotated, Any
from pydantic import BeforeValidator, PlainSerializer, InstanceOf, WithJsonSchema
import pandas as pd


def md_to_df(data: Any) -> Any:
    # Convert markdown to DataFrame
    if isinstance(data, str):
        return (
            pd.read_csv(
                StringIO(data),  # Process data
                sep="|",
                index_col=1,
            )
            .dropna(axis=1, how="all")
            .iloc[1:]
            .applymap(lambda x: x.strip())
        )
    return data


MarkdownDataFrame = Annotated[
    InstanceOf[pd.DataFrame],
    BeforeValidator(md_to_df),
    PlainSerializer(lambda df: df.to_markdown()),
    WithJsonSchema(
        {
            "type": "string",
            "description": "The markdown representation of the table, each one should be tidy, do not try to join tables that should be separate",
        }
    ),
]

Extracting the table¶

Then with this new custom data type, it becomes easy to just pass the image to the LLM and get a tidy dataframe in response.

import instructor
from pydantic import BaseModel
from openai import OpenAI


class Table(BaseModel):
    caption: str
    dataframe: MarkdownDataFrame  # Custom type for handling tables


class TidyTables(BaseModel):
    tables: list[Table]


# Patch the OpenAI client with instructor
client = instructor.from_openai(OpenAI())


def extract_table(image_path: str) -> TidyTables:
    return client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": [
                    "Convert this untidy table to tidy format",
                    instructor.Image.from_path(image_path),
                ],
            }
        ],
        response_model=TidyTables,
    )


extracted_tables = extract_table("./untidy_table.png")

This then returns the following output for us as a single pandas dataframe which we can easily plot and do any sort of data analysis on.

ID	GTT date	GTT weight	time	glucose mg/dl	insulin ng/ml	Comment
321	2/9/15	24.5	0	99.2		lo off curve
321	2/9/15	24.5	5	349.3	0.205
321	2/9/15	24.5	15	286.1	0.129
321	2/9/15	24.5	30	312	0.175
321	2/9/15	24.5	60	99.9	0.122
321	2/9/15	24.5	120	217.9		lo off curve
322	2/9/15	18.9	0	185.8	0.251
322	2/9/15	18.9	5	297.4	2.228
322	2/9/15	18.9	15	439	2.078
322	2/9/15	18.9	30	362.3	0.775
322	2/9/15	18.9	60	232.7	0.5
322	2/9/15	18.9	120	260.7	0.523
323	2/9/15	24.7	0	198.5	0.151
323	2/9/15	24.7	5	530.6		off curve lo

More importantly, we can also extract multiple tables from a single image. This would be useful in helping to segment and identify different sections of a messy report. With tidy data, we get the benefits of

Each variable being its own column
Each observation being its own row
Each value having its own cell
Seamlessly working with pandas/numpy operations
Visualization libraries "just working"

Conclusion¶

We can actually go one step further and make this even tidier by converting things like weight, glucose and insulin into a specific column called metric which would allow us to add arbitrary metrics to the table without having to change the schema or our plotting code. This is a huge productivity boost when doing complex data analysis.

No more wrestling with complex data cleaning pipelines. Let the model handle the heavy lifting while you focus on analysis. With instructor, getting to that step just became a whole lot easier.

Give instructor a try today and see how you can build reliable applications. Just run pip install instructor or check out our Getting Started Guide