Extracting Metadata from Images using Structured Extraction

Multimodal Language Models like gpt-4o excel at processing multimodal, enabling us to extract rich, structured metadata from images.

This is particularly valuable in areas like fashion where we can use these capabilities to understand user style preferences from images and even videos. In this post, we'll see how to use instructor to map images to a given product taxonomy so we can recommend similar products for users.

Why Image Metadata is useful¶

Most online e-commerce stores have a taxonomy of products that they sell. This is a way of categorizing products so that users can easily find what they're looking for.

A small example of a taxonomy is shown below. You can think of this as a way of mapping a product to a set of attributes, with some common attributes that are shared across all products.

tops:
  t-shirts:
    - crew_neck
    - v_neck
    - graphic_tees
  sweaters:
    - crewneck
    - cardigan
    - pullover
  jackets:
    - bomber_jackets
    - denim_jackets
    - leather_jackets

bottoms:
  pants:
    - chinos
    - dress_pants
    - cargo_pants
  shorts:
    - athletic_shorts
    - cargo_shorts

colors:
  - black
  - navy
  - white
  - beige
  - brown

By using this taxonomy, we can ensure that our model is able to extract metadata that is consistent with the products we sell. In this example, we'll analyze style photos from a fitness influencer to understand their fashion preferences and possibily see what products we can recommend from our own catalog to him.

We're using some photos from a fitness influencer called Jpgeez which you can see below.

While we're mapping these visual elements over to a taxonomy, this is really applicable to any other use case where you want to extract metadata from images.

Extracting metadata from images¶

Instructor's `Image` class¶

With instructor, working with multimodal data is easy. We can use the Image class to load images from a URL or local file. We can see this below in action.

import instructor

# Load images using instructor.Image.from_path
images = []
for image_file in image_files:
    image_path = os.path.join("./images", image_file)
    image = instructor.Image.from_path(image_path)
    images.append(image)

We provide a variety of different methods for loading images, including from a URL, local file, and even from a base64 encoded string which you can read about here

Defining a response model¶

Since our taxonomy is defined as a yaml file, we can't use literals to define the response model. Instead, we can read in the configuration from a yaml file and then use that in a model_validator step to make sure that the metadata we extract is consistent with the taxonomy.

First, we read in the taxonomy from a yaml file and create a set of categories, subcategories, and product types.

import yaml

with open("taxonomy.yml", "r") as file:
    taxonomy = yaml.safe_load(file)

colors = taxonomy["colors"]
categories = set(taxonomy.keys())
categories.remove("colors")

subcategories = set()
product_types = set()
for category in categories:
    for subcategory in taxonomy[category].keys():
        subcategories.add(subcategory)
        for product_type in taxonomy[category][subcategory]:
            product_types.add(product_type)

Then we can use these in our response_model to make sure that the metadata we extract is consistent with the taxonomy.

class PersonalStyle(BaseModel):
    """
    Ideally you map this to a specific taxonomy
    """

    categories: list[str]
    subcategories: list[str]
    product_types: list[str]
    colors: list[str]

    @model_validator(mode="after")
    def validate_options(self, info: ValidationInfo):
        context = info.context
        colors = context["colors"]
        categories = context["categories"]
        subcategories = context["subcategories"]
        product_types = context["product_types"]

        # Validate colors
        for color in self.colors:
            if color not in colors:
                raise ValueError(
                    f"Color {color} is not in the taxonomy. Valid colors are {colors}"
                )
        for category in self.categories:
            if category not in categories:
                raise ValueError(
                    f"Category {category} is not in the taxonomy. Valid categories are {categories}"
                )

        for subcategory in self.subcategories:
            if subcategory not in subcategories:
                raise ValueError(
                    f"Subcategory {subcategory} is not in the taxonomy. Valid subcategories are {subcategories}"
                )

        for product_type in self.product_types:
            if product_type not in product_types:
                raise ValueError(
                    f"Product type {product_type} is not in the taxonomy. Valid product types are {product_types}"
                )

        return self

Making the API call¶

Lastly, we can combine these all into a single api call to gpt-4o where we pass in all of the images and the response model into the response_model parameter.

With our inbuilt support for jinja formatting using the context keyword that exposes data we can also re-use in our validation, this becomes an incredibly easy step to execute.

import openai
import instructor

client = instructor.from_provider("openai/gpt-5-nano")

resp = client.create(
    model="gpt-4o",
    messages=[
        {
            "role": "system",
            "content": """
You are a helpful assistant. You are given a list of images and you need to map the person style of the person in the image to a given taxonomy.

Here is the taxonomy that you should use

Colors:
{% for color in colors %}
* {{ color }}
{% endfor %}

Categories:
{% for category in categories %}
* {{ category }}
{% endfor %}

Subcategories:
{% for subcategory in subcategories %}
* {{ subcategory }}
{% endfor %}

Product types:
{% for product_type in product_types %}
* {{ product_type }}
{% endfor %}
""",
        },
        {
            "role": "user",
            "content": [
                "Here are the images of the person, describe the personal style of the person in the image from a first-person perspective( Eg. You are ... )",
                *images,
            ],
        },
    ],
    response_model=PersonalStyle,
    context={
        "colors": colors,
        "categories": list(categories),
        "subcategories": list(subcategories),
        "product_types": list(product_types),
    },
)

This then returns the following response.

PersonalStyle(
    categories=['tops', 'bottoms'],
    subcategories=['sweaters', 'jackets', 'pants'],
    product_types=['cardigan', 'crewneck', 'denim_jackets', 'chinos'],
    colors=['brown', 'beige', 'black', 'white', 'navy']
)

Looking Ahead¶

The ability to extract structured metadata from images opens up exciting possibilities for personalization in e-commerce. The key is maintaining the bridge between unstructured visual inspiration and structured product data through well-defined taxonomies and robust validation.

instructor makes working with multimodal data easy, and we're excited to see what you build with it. Give us a try today with pip install instructor and see how easy it is to work with language models using structured extraction.