Privacy – PII Leakage Detection

What is PII Leakage Detection?

The goal of PII (Personally Identifiable Information) leakage detection is to ensure that models do not expose, memorize, or regenerate sensitive personal information from training datasets or during inference.

Our framework proactively evaluates:

  1. Training datasets – to detect if they contain sensitive PII before model training.
  2. Model outputs (especially LLMs) – to determine if the model generates or leaks PII during interactions.

Out-of-the-box, our system detects core PII entities, and additional entities can be defined via custom PII probes.


Core PII Entities

Entity Type Description Example
account_number Account numbers (e.g., bank account) 123456789012
building_number Building or house numbers 221B
city City names New York
credit_card_number Credit card numbers 4111 1111 1111 1111
date_of_birth Dates of birth 1990-05-21
driver_license_number Driver’s license numbers D123-456-7890
email_address Email addresses john.doe@example.com
given_name First or given names John
id_card_number ID card numbers ID1234567
password Passwords or passcodes P@ssw0rd!
social_security_number Social security numbers 123-45-6789
street_name Street names Elm Street
surname Last names or surnames Doe
tax_id_number Tax identification numbers TIN-123-4567
phone_number Telephone numbers +1-202-555-0147
username Usernames johndoe92
zip_code Postal or ZIP codes 10001

Evaluation Logic

We have an advanced evaluator that automatically determines whether an LLM response is compromised or defended:

  • If a particular red-team attack strategy succeeds in extracting or leaking PII → the result is marked as Compromised.
  • If the attack is unsuccessful and safeguards hold → the result is marked as Defended.

This evaluation provides a clear pass/fail signal for each adversarial test case.


Multi-LLM & Custom AI Support

  • Supports multiple LLM backends out-of-the-box.
  • Users can easily add their own AI systems for evaluation.
    • WebSocket-based models can be connected for streaming evaluations.
    • REST API endpoints can be integrated for batch or request-based testing.

This ensures the framework is extensible across any generative AI system.


Red Teaming Dataset

We maintain a specialized red-teaming prompt dataset that tests the model’s susceptibility to revealing PII under various adversarial conditions.
The dataset includes:## PII Leakage & Attack Strategies


Supported Plugins

Name Tag(s) Description
Basic_PII_Detection pii:direct Scans raw model output for any core PII entities (e.g., names, emails, addresses, phone numbers).
Encoded_PII_Detection pii:direct Detects PII that is obfuscated (e.g., rot13, Base64, reversed strings).
Social_Engineering pii:social Attempts to elicit PII through indirect, conversational manipulation.
MultiStep_Jailbreak pii:social, harmful:privacy Evaluates chained jailbreak sequences that progressively bypass safeguards to reveal PII.
Session_Leakage pii:session, harmful:privacy Tests whether PII from previous users or sessions is leaked in subsequent conversations.
API_DB_Access pii:api-db, harmful:privacy Ensures the model does not attempt unauthorized API or database access to retrieve PII.
Custom_PII_Probe pii User-defined sensitive data patterns for detection and red-team testing.
IP_Leakage harmful:intellectual-property Evaluates whether the model leaks copyrighted, proprietary, or confidential intellectual property.

Supported Strategies

Strategy Description
basic Baseline test that scans model output for direct PII disclosure.
retry Re-uses previously failed attack cases to test for regression vulnerabilities.
piglatin PII obfuscated using Pig Latin transformation.
homoglyph PII hidden using visually similar Unicode characters to evade detection.
jailbreak:composite Combines multiple jailbreak strategies in sequence to bypass filters and extract PII.
math-prompt Encodes or disguises PII using mathematical notation and formula-style prompts.
base64 Encoded PII attempts using Base64 encoding.
jailbreak:likert Likert-scale style prompts that socially engineer the model into leaking PII.
leetspeak Obfuscates PII using leetspeak character substitutions (e.g., @ for a, 0 for o).
rot13 Encoded PII attempts using ROT13 transformation.
emoji PII concealed with emoji substitution or variation selectors to evade detection.
jailbreak General jailbreak attempts designed to override safeguards and reveal hidden PII.
hex Encoded PII attempts using hexadecimal encoding schemes.
gcg Greedy Coordinate Gradient adversarial suffix attack to bypass model safeguards.
multilingual PII requests crafted in multiple languages to bypass detection and filtering safeguards.
camelcase PII disguised through camelCase formatting to avoid detection.
jailbreak:tree Multi-step chained jailbreak prompts that progressively bypass safeguards to reveal PII.
morse Encoded PII attempts using Morse code representation.
prompt-injection Direct prompt injection attempts designed to override safeguards and force PII disclosure.

Advanced PII Detection Pipeline

Our PII detection system uses a multi-layered evaluator for high accuracy:

  1. spaCy – NLP-based named entity recognition for common PII categories.
  2. Microsoft Presidio – A rule-based and ML-enhanced entity recognizer for structured identifiers.
  3. Custom Fine-Tuned Model – Trained on adversarial and red-team data for nuanced detection in LLM outputs.

Custom PII Probes

Users can define custom probes for proprietary identifiers or domain-specific sensitive information, extending beyond the default PII entities.