Skip to content

PII Data Extraction and Scrubbing

Overview

This example demonstrates the usage of OpenAI's ChatCompletion model for the extraction and scrubbing of Personally Identifiable Information (PII) from a document. The code defines Pydantic models to manage the PII data and offers methods for both extraction and sanitation.

Defining the Structures

First, Pydantic models are defined to represent the PII data and the overall structure for PII data extraction.

from typing import List
from pydantic import BaseModel


# Define Schemas for PII data
class Data(BaseModel):
    index: int
    data_type: str
    pii_value: str


class PIIDataExtraction(BaseModel):
    """
    Extracted PII data from a document, all data_types should try to have consistent property names
    """

    private_data: List[Data]

    def scrub_data(self, content: str) -> str:
        """
        Iterates over the private data and replaces the value with a placeholder in the form of
        <{data_type}_{i}>
        """
        for i, data in enumerate(self.private_data):
            content = content.replace(data.pii_value, f"<{data.data_type}_{i}>")
        return content

Extracting PII Data

The OpenAI API is utilized to extract PII information from a given document.

from openai import OpenAI
import instructor


client = instructor.from_openai(OpenAI())

EXAMPLE_DOCUMENT = """
# Fake Document with PII for Testing PII Scrubbing Model
# (The content here)
"""

pii_data = client.chat.completions.create(
    model="gpt-4o-mini",
    response_model=PIIDataExtraction,
    messages=[
        {
            "role": "system",
            "content": "You are a world class PII scrubbing model, Extract the PII data from the following document",
        },
        {
            "role": "user",
            "content": EXAMPLE_DOCUMENT,
        },
    ],
)  # type: ignore

print("Extracted PII Data:")
#> Extracted PII Data:
print(pii_data.model_dump_json())
"""
{"private_data":[{"index":0,"data_type":"Name","pii_value":"John Doe"},{"index":1,"data_type":"Email","pii_value":"[email protected]"},{"index":2,"data_type":"Phone Number","pii_value":"(555) 123-4567"},{"index":3,"data_type":"Address","pii_value":"123 Main St, Anytown, USA"},{"index":4,"data_type":"Social Security Number","pii_value":"123-45-6789"}]}
"""

Output of Extracted PII Data

{
  "private_data": [
    {
      "index": 0,
      "data_type": "date",
      "pii_value": "01/02/1980"
    },
    {
      "index": 1,
      "data_type": "ssn",
      "pii_value": "123-45-6789"
    },
    {
      "index": 2,
      "data_type": "email",
      "pii_value": "[email protected]"
    },
    {
      "index": 3,
      "data_type": "phone",
      "pii_value": "555-123-4567"
    },
    {
      "index": 4,
      "data_type": "address",
      "pii_value": "123 Main St, Springfield, IL, 62704"
    }
  ]
}

Scrubbing PII Data

After extracting the PII data, the scrub_data method is used to sanitize the document.

print("Scrubbed Document:")
#> Scrubbed Document:
print(pii_data.scrub_data(EXAMPLE_DOCUMENT))
"""
# Fake Document with PII for Testing PII Scrubbing Model
# He was born on <date_0>. His social security number is <ssn_1>. He has been using the email address <email_2> for years, and he can always be reached at <phone_3>.
"""

Output of Scrubbed Document

# Fake Document with PII for Testing PII Scrubbing Model

## Personal Story

John Doe was born on <date_0>. His social security number is <ssn_1>. He has been using the email address <email_2> for years, and he can always be reached at <phone_3>.

## Residence

John currently resides at <address_4>. He's been living there for about 5 years now.