Structured Outputs with Multimodal Gemini
In this post, we'll explore how to use Google's Gemini model with Instructor to analyze travel videos and extract structured recommendations. This powerful combination allows us to process multimodal inputs (video) and generate structured outputs using Pydantic models. This post was done in collaboration with Kino.ai, a company that uses instructor to do structured extraction from multimodal inputs to improve search for film makers.
Setting Up the Environment
First, let's set up our environment with the necessary libraries: