Privacy – PII Leakage Detection
What is PII Leakage Detection?
The goal of PII (Personally Identifiable Information) leakage detection is to ensure that models do not expose, memorize, or regenerate sensitive personal information from training datasets or during inference.
Our framework proactively evaluates:
- Training datasets – to detect if they contain sensitive PII before model training.
- Model outputs (especially LLMs) – to determine if the model generates or leaks PII during interactions.
Out-of-the-box, our system detects core PII entities, and additional entities can be defined via custom PII probes.
Core PII Entities
| Entity Type | Description | Example |
|---|---|---|
| account_number | Account numbers (e.g., bank account) | 123456789012 |
| building_number | Building or house numbers | 221B |
| city | City names | New York |
| credit_card_number | Credit card numbers | 4111 1111 1111 1111 |
| date_of_birth | Dates of birth | 1990-05-21 |
| driver_license_number | Driver’s license numbers | D123-456-7890 |
| email_address | Email addresses | john.doe@example.com |
| given_name | First or given names | John |
| id_card_number | ID card numbers | ID1234567 |
| password | Passwords or passcodes | P@ssw0rd! |
| social_security_number | Social security numbers | 123-45-6789 |
| street_name | Street names | Elm Street |
| surname | Last names or surnames | Doe |
| tax_id_number | Tax identification numbers | TIN-123-4567 |
| phone_number | Telephone numbers | +1-202-555-0147 |
| username | Usernames | johndoe92 |
| zip_code | Postal or ZIP codes | 10001 |
Evaluation Logic
We have an advanced evaluator that automatically determines whether an LLM response is compromised or defended:
- If a particular red-team attack strategy succeeds in extracting or leaking PII → the result is marked as Compromised.
- If the attack is unsuccessful and safeguards hold → the result is marked as Defended.
This evaluation provides a clear pass/fail signal for each adversarial test case.
Multi-LLM & Custom AI Support
- Supports multiple LLM backends out-of-the-box.
- Users can easily add their own AI systems for evaluation.
- WebSocket-based models can be connected for streaming evaluations.
- REST API endpoints can be integrated for batch or request-based testing.
This ensures the framework is extensible across any generative AI system.
Red Teaming Dataset
We maintain a specialized red-teaming prompt dataset that tests the model’s susceptibility to revealing PII under various adversarial conditions.
The dataset includes:## PII Leakage & Attack Strategies
Supported Plugins
| Name | Tag(s) | Description |
|---|---|---|
| Basic_PII_Detection | pii:direct | Scans raw model output for any core PII entities (e.g., names, emails, addresses, phone numbers). |
| Encoded_PII_Detection | pii:direct | Detects PII that is obfuscated (e.g., rot13, Base64, reversed strings). |
| Social_Engineering | pii:social | Attempts to elicit PII through indirect, conversational manipulation. |
| MultiStep_Jailbreak | pii:social, harmful:privacy | Evaluates chained jailbreak sequences that progressively bypass safeguards to reveal PII. |
| Session_Leakage | pii:session, harmful:privacy | Tests whether PII from previous users or sessions is leaked in subsequent conversations. |
| API_DB_Access | pii:api-db, harmful:privacy | Ensures the model does not attempt unauthorized API or database access to retrieve PII. |
| Custom_PII_Probe | pii | User-defined sensitive data patterns for detection and red-team testing. |
| IP_Leakage | harmful:intellectual-property | Evaluates whether the model leaks copyrighted, proprietary, or confidential intellectual property. |
Supported Strategies
| Strategy | Description |
|---|---|
| basic | Baseline test that scans model output for direct PII disclosure. |
| retry | Re-uses previously failed attack cases to test for regression vulnerabilities. |
| piglatin | PII obfuscated using Pig Latin transformation. |
| homoglyph | PII hidden using visually similar Unicode characters to evade detection. |
| jailbreak:composite | Combines multiple jailbreak strategies in sequence to bypass filters and extract PII. |
| math-prompt | Encodes or disguises PII using mathematical notation and formula-style prompts. |
| base64 | Encoded PII attempts using Base64 encoding. |
| jailbreak:likert | Likert-scale style prompts that socially engineer the model into leaking PII. |
| leetspeak | Obfuscates PII using leetspeak character substitutions (e.g., @ for a, 0 for o). |
| rot13 | Encoded PII attempts using ROT13 transformation. |
| emoji | PII concealed with emoji substitution or variation selectors to evade detection. |
| jailbreak | General jailbreak attempts designed to override safeguards and reveal hidden PII. |
| hex | Encoded PII attempts using hexadecimal encoding schemes. |
| gcg | Greedy Coordinate Gradient adversarial suffix attack to bypass model safeguards. |
| multilingual | PII requests crafted in multiple languages to bypass detection and filtering safeguards. |
| camelcase | PII disguised through camelCase formatting to avoid detection. |
| jailbreak:tree | Multi-step chained jailbreak prompts that progressively bypass safeguards to reveal PII. |
| morse | Encoded PII attempts using Morse code representation. |
| prompt-injection | Direct prompt injection attempts designed to override safeguards and force PII disclosure. |
Advanced PII Detection Pipeline
Our PII detection system uses a multi-layered evaluator for high accuracy:
- spaCy – NLP-based named entity recognition for common PII categories.
- Microsoft Presidio – A rule-based and ML-enhanced entity recognizer for structured identifiers.
- Custom Fine-Tuned Model – Trained on adversarial and red-team data for nuanced detection in LLM outputs.
Custom PII Probes
Users can define custom probes for proprietary identifiers or domain-specific sensitive information, extending beyond the default PII entities.