Ensuring Consistent Timestamp Formats with Language Models¶

Gemini can Understand timestamps in language model outputs, but they can be inconsistent. Video content timestamps vary between HH:MM:SS and MM:SS formats, causing parsing errors and calculations. This post presents a technique to handle timestamps for clips and films without formatting issues.

We combine Pydantic's data validation with custom parsing for consistent timestamp handling. You'll learn to process timestamps in any format, reducing errors in video content workflows. Kinda like how we ensured matching language in multilingal summarization by adding a simple field.

The post provides a solution using Pydantic to improve timestamp handling in language model projects. This method addresses format inconsistencies and enables timestamp processing.

The Problem¶

Consider a scenario where we're using a language model to generate timestamps for video segments. For shorter videos, timestamps might be in MM:SS format, while longer videos require HH:MM:SS. This inconsistency can lead to parsing errors and incorrect time calculations.

Here's a simple example of how this problem might manifest:

class Segment(BaseModel):
    title: str = Field(..., description="The title of the segment")
    timestamp: str = Field(..., description="The timestamp of the event as HH:MM:SS")


# This might work for some cases, but fails for others:
# "2:00" could be interpreted as 2 minutes or 2 hours
# "1:30:00" doesn't fit the expected format

This approach doesn't account for the variability in timestamp formats and can lead to misinterpretations.

The Solution¶

To address this issue, we can use a combination of Pydantic for data validation and a custom parser to handle different timestamp formats. Here's how we can implement this:

Define the expected time formats
Use a custom validator to parse and normalize the timestamps
Ensure the output is always in a consistent format

Let's look at the improved implementation:

from pydantic import BaseModel, Field, model_validator
from typing import Literal


class SegmentWithTimestamp(BaseModel):
    title: str = Field(..., description="The title of the segment")
    time_format: Literal["HH:MM:SS", "MM:SS"] = Field(
        ..., description="The format of the timestamp"
    )
    timestamp: str = Field(
        ..., description="The timestamp of the event as either HH:MM:SS or MM:SS"
    )

    @model_validator(mode="after")
    def parse_timestamp(self):
        if self.time_format == "HH:MM:SS":
            hours, minutes, seconds = map(int, self.timestamp.split(":"))
        elif self.time_format == "MM:SS":
            hours, minutes, seconds = 0, *map(int, self.timestamp.split(":"))
        else:
            raise ValueError("Invalid time format, must be HH:MM:SS or MM:SS")

        # Normalize seconds and minutes
        total_seconds = hours * 3600 + minutes * 60 + seconds
        hours, remainder = divmod(total_seconds, 3600)
        minutes, seconds = divmod(remainder, 60)

        if hours > 0:
            self.timestamp = f"{hours:02d}:{minutes:02d}:{seconds:02d}"
        else:
            self.timestamp = f"00:{minutes:02d}:{seconds:02d}"

        return self

This implementation offers several advantages:

It explicitly defines the expected time format, reducing ambiguity.
The custom validator parses the input based on the specified format.
It normalizes all timestamps to a consistent HH:MM:SS format.
It handles edge cases, such as when minutes or seconds exceed 59.

Why This Works Better Than Alternatives¶

You might wonder why we can't solve this problem with constrained sampling methods or JSON schema alone. The reason is that timestamp parsing often requires context-aware processing that goes beyond simple pattern matching.

Constrained sampling might enforce a specific format, but it doesn't handle the conversion between different formats or normalization of times.
JSON schema can validate the structure of the data, but it can't perform the complex parsing and normalization required for timestamps.

Our approach combines the strengths of schema validation (using Pydantic) with custom logic to handle the intricacies of timestamp formatting.

Testing the Solution¶

To ensure our implementation works as expected, we can create some test cases:

if __name__ == "__main__":
    # Test cases for SegmentWithTimestamp
    test_cases = [
        (
            SegmentWithTimestamp(
                title="Introduction", time_format="MM:SS", timestamp="00:30"
            ),
            "00:00:30",
        ),
        (
            SegmentWithTimestamp(
                title="Main Topic", time_format="HH:MM:SS", timestamp="00:15:45"
            ),
            "00:15:45",
        ),
        (
            SegmentWithTimestamp(
                title="Conclusion", time_format="MM:SS", timestamp="65:00"
            ),
            "01:05:00",
        ),
    ]

    for input_data, expected_output in test_cases:
        try:
            assert input_data.timestamp == expected_output
            print(f"Test passed: {input_data.timestamp} == {expected_output}")
        except AssertionError:
            print(f"Test failed: {input_data.timestamp} != {expected_output}")

    # Output:
    # Test passed: 00:00:30 == 00:00:30
    # Test passed: 00:15:45 == 00:15:45
    # Test passed: 01:05:00 == 01:05:00

These test cases demonstrate that our solution correctly handles different input formats and normalizes them to a consistent output format.

Conclusion¶

Parsing and validation are needed when handling language model outputs. Its not about coercing language models, but building valid inputs into downstream systems. Combining Pydantic's validation with logic ensures handling across formats. This approach solves timestamp inconsistency and provides a framework for challenges in NLP tasks.

When dealing with time-based data in language models, account for format variability and implement validation and normalization to maintain consistency.