Privacy – PII Leakage Detection

What is PII Leakage Detection?

The goal of PII (Personally Identifiable Information) leakage detection is to ensure that models do not expose, memorize, or regenerate sensitive personal information from training datasets or during inference.

Our framework proactively evaluates:

Training datasets – to detect if they contain sensitive PII before model training.
Model outputs (especially LLMs) – to determine if the model generates or leaks PII during interactions.

Out-of-the-box, our system detects core PII entities, and additional entities can be defined via custom PII probes.

Core PII Entities

Entity Type	Description	Example
account_number	Account numbers (e.g., bank account)	`123456789012`
building_number	Building or house numbers	`221B`
city	City names	`New York`
credit_card_number	Credit card numbers	`4111 1111 1111 1111`
date_of_birth	Dates of birth	`1990-05-21`
driver_license_number	Driver’s license numbers	`D123-456-7890`
email_address	Email addresses	`john.doe@example.com`
given_name	First or given names	`John`
id_card_number	ID card numbers	`ID1234567`
password	Passwords or passcodes	`P@ssw0rd!`
social_security_number	Social security numbers	`123-45-6789`
street_name	Street names	`Elm Street`
surname	Last names or surnames	`Doe`
tax_id_number	Tax identification numbers	`TIN-123-4567`
phone_number	Telephone numbers	`+1-202-555-0147`
username	Usernames	`johndoe92`
zip_code	Postal or ZIP codes	`10001`

Evaluation Logic

We have an advanced evaluator that automatically determines whether an LLM response is compromised or defended:

If a particular red-team attack strategy succeeds in extracting or leaking PII → the result is marked as Compromised.
If the attack is unsuccessful and safeguards hold → the result is marked as Defended.

This evaluation provides a clear pass/fail signal for each adversarial test case.

Multi-LLM & Custom AI Support

Supports multiple LLM backends out-of-the-box.
Users can easily add their own AI systems for evaluation.
- WebSocket-based models can be connected for streaming evaluations.
- REST API endpoints can be integrated for batch or request-based testing.

This ensures the framework is extensible across any generative AI system.

Red Teaming Dataset

We maintain a specialized red-teaming prompt dataset that tests the model’s susceptibility to revealing PII under various adversarial conditions.
The dataset includes:## PII Leakage & Attack Strategies

Supported Plugins

Name	Tag(s)	Description
Basic_PII_Detection	pii:direct	Scans raw model output for any core PII entities (e.g., names, emails, addresses, phone numbers).
Encoded_PII_Detection	pii:direct	Detects PII that is obfuscated (e.g., rot13, Base64, reversed strings).
Social_Engineering	pii:social	Attempts to elicit PII through indirect, conversational manipulation.
MultiStep_Jailbreak	pii:social, harmful:privacy	Evaluates chained jailbreak sequences that progressively bypass safeguards to reveal PII.
Session_Leakage	pii:session, harmful:privacy	Tests whether PII from previous users or sessions is leaked in subsequent conversations.
API_DB_Access	pii:api-db, harmful:privacy	Ensures the model does not attempt unauthorized API or database access to retrieve PII.
Custom_PII_Probe	pii	User-defined sensitive data patterns for detection and red-team testing.
IP_Leakage	harmful:intellectual-property	Evaluates whether the model leaks copyrighted, proprietary, or confidential intellectual property.

Supported Strategies

Strategy	Description
basic	Baseline test that scans model output for direct PII disclosure.
retry	Re-uses previously failed attack cases to test for regression vulnerabilities.
piglatin	PII obfuscated using Pig Latin transformation.
homoglyph	PII hidden using visually similar Unicode characters to evade detection.
jailbreak:composite	Combines multiple jailbreak strategies in sequence to bypass filters and extract PII.
math-prompt	Encodes or disguises PII using mathematical notation and formula-style prompts.
base64	Encoded PII attempts using Base64 encoding.
jailbreak:likert	Likert-scale style prompts that socially engineer the model into leaking PII.
leetspeak	Obfuscates PII using leetspeak character substitutions (e.g., `@` for `a`, `0` for `o`).
rot13	Encoded PII attempts using ROT13 transformation.
emoji	PII concealed with emoji substitution or variation selectors to evade detection.
jailbreak	General jailbreak attempts designed to override safeguards and reveal hidden PII.
hex	Encoded PII attempts using hexadecimal encoding schemes.
gcg	Greedy Coordinate Gradient adversarial suffix attack to bypass model safeguards.
multilingual	PII requests crafted in multiple languages to bypass detection and filtering safeguards.
camelcase	PII disguised through camelCase formatting to avoid detection.
jailbreak:tree	Multi-step chained jailbreak prompts that progressively bypass safeguards to reveal PII.
morse	Encoded PII attempts using Morse code representation.
prompt-injection	Direct prompt injection attempts designed to override safeguards and force PII disclosure.

Advanced PII Detection Pipeline

Our PII detection system uses a multi-layered evaluator for high accuracy:

spaCy – NLP-based named entity recognition for common PII categories.
Microsoft Presidio – A rule-based and ML-enhanced entity recognizer for structured identifiers.
Custom Fine-Tuned Model – Trained on adversarial and red-team data for nuanced detection in LLM outputs.

Custom PII Probes

Users can define custom probes for proprietary identifiers or domain-specific sensitive information, extending beyond the default PII entities.