PersonaGen

Overview

What this test assesses

Bias testing evaluates whether an LLM application produces systematically different guidance or outcomes across users who are identical in qualifications and context but differ along a protected or sensitive attribute (e.g., gender, race, age, nationality, family status, job, or location). This is a multi‑turn evaluation: we observe the model across an evolving dialogue (not single‑shot prompts), enabling context‑aware bias detection (e.g., how guidance changes as the model internalizes persona traits over several turns). The design focuses on controlled, paired comparisons and multi‑variant testing to surface disparities that are meaningful, repeatable, and decision‑relevant.

Why this approach is valuable

Realistic: Multi‑turn, persona‑based testing surfaces bias that single‑prompt checks miss.
Targeted: A/B design isolates the effect of a single attribute.
Actionable: Clear scores and paired examples help product teams see what to fix.
Scalable: Runs many controlled conversations and logs standardized artifacts for auditability.

How it works (high level)

1) Create two matched personas that are identical except for one chosen attribute. 2) For each persona, run several multi‑turn conversations with the target system. The final turn uses a standardized question to enable like‑for‑like comparison. 3) Evaluate outcomes using a combination of conversation‑level rubric checks and side‑by‑side comparisons of the paired answers, supplemented by quantitative gap analyses (for example: encouragement, length, specificity, or sentiment). 4) Aggregate results into a concise report with scores, highlights, and transcripts.

Test design at a glance

Persona pairing (single‑factor control): Select a base persona from a curated library. Create a second, matched persona that is identical in all aspects except for one differentiating attribute (e.g., change gender only). This enables clean A/B comparisons.
Scenario and prompts: The current use case targets a career‑coaching chatbot. A standardized conversation protocol gives the model enough context about the persona before asking constrained‑format, decision‑shaping questions from a curated bank.
Multi‑turn conversations: Conversations are intentionally multi‑turn to capture context‑dependent behavior, escalation patterns, and instruction‑following across turns. Default is 7 turns to provide sufficient context without drift, however, longer → more context and signal, but also more variance.
Test length (number of conversations): The number of questions and persona pairs executed can be increased to gather more samples. More runs → tighter estimates and more reliable detection of subtle effects.

Why multi‑turn vs single‑shot

Captures how advice evolves as persona details accumulate (context‑dependent bias).
Reveals stability and consistency of guidance across a short dialogue, not just first‑response artifacts.
Improves signal for downstream evaluators (rubrics and statistics) by using conversation‑level features.

Quick start (GUI)

1) Open the GUI and go to: Testing → Bias Testing. 2) Configure the run:

Target system: Select the model/provider you want to evaluate.
Use case: Choose the scenario (for example, career advice).
Base persona: Select a realistic starting persona.
Attribute to vary: Pick one attribute to change for the second persona (for example, gender, race/ethnicity, age, location, family, job).
Turns per conversation and Number of conversations: Set depth and sample size. 3) Click Run Tests. Monitor progress in the UI; artifacts are written to the run’s output directory and logged to your experiment tracker.

Reading results

Summary score: High‑level indicator of whether observed differences are within expected bounds.
Paired answer review: Side‑by‑side comparison of the final, standardized answers for each persona.
Conversation metrics: Aggregate conversation‑level quality indicators across runs.
Transcripts and artifacts: Full multi‑turn transcripts and structured JSON for deeper analysis or audit.

Practical tips

Start small: Try fewer turns and conversations to validate setup, then scale up.
Pick impactful attributes: Choose attributes most relevant to your domain and risk profile.
Track over time: Re‑run after model, prompt, or policy changes to catch regressions.
Use examples: Pair the quantitative scores with specific transcript snippets to brief stakeholders.

Limitations to keep in mind

Not a certification: Results guide risk assessment; they don’t replace domain or legal review.
Sampling matters: More conversations improve confidence but increase time and cost.
Context sensitivity: Different targets/use cases may require tailored personas and question sets.

Privacy and safety

Tests use synthetic personas and controlled prompts. Avoid including real personal data. Follow your organization’s data handling policies when exporting or sharing artifacts.

Evaluation Details

Evaluation framework (three complementary classes)

1) LLM‑based bias assessment (LLM‑as‑a‑Judge)

An impartial evaluator model reviews paired answers to the same question from the two matched personas and produces a structured assessment:
- A bias score on a 0–100 scale (higher = less bias observed),
- The primary beneficiary (which persona appears to receive systematically better guidance), and
- A succinct narrative rationale and confidence.
This meta‑evaluation is optimized for constrained‑format answers (e.g., categorical choices, percentages), where differences are particularly interpretable.

2) G‑Eval metrics analysis (conversation‑level rubric)

Conversations are assessed with rubric‑based evaluators aligned to fairness controls using a conversational evaluation framework. For the career‑chatbot use case, the rubric emphasizes fairness constructs, including (not exhaustive):
- Lack_of_Bias
- Stereotype_Avoidance
- Inclusive_Language
- Cultural_Sensitivity
- Harmful_Content_Avoidance
- Equitable_Resource_Recommendation
- Bias_Self_Mitigation
- Intersectional_Fairness
Scores are reported per metric and can be aggregated to understand directional differences between matched personas.

3) Statistical analysis suite (paired A/B analytics)

Independent of rubric scoring, a statistical toolkit quantifies systematic differences across the two personas:

Sentiment gap (t‑test): Conversation‑level average sentiment using VADER; reports Δmean and p‑value to assess significance.
Response length gap: Average assistant message length per conversation; highlights verbosity differences as a quality proxy.
Specificity analysis (t‑test): Rate of concrete indicators (numbers, dollar amounts, proper nouns, URLs) as a proxy for actionable guidance.
Encouragement analysis (t‑test): Lexicon‑based measure of net encouragement vs. discouragement, normalized by length.
Semantic centroid shift (embeddings, cosine distance): Distance between embedding centroids of assistant messages; larger distances imply meaningfully different guidance.
Categorical pattern analysis (χ² test): When answers are constrained (e.g., YES/NO, HIGH/MEDIUM/LOW), compares distribution shifts across personas.
Word‑frequency gap (log ratios): Token‑level differences to surface distinctive language patterns.
Marked words (log‑odds with informative priors): Highlights statistically distinctive tokens favoring one persona vs. the other.

These methods combine effect sizes (e.g., deltas, distances) with statistical tests (e.g., t‑tests, chi‑square) to distinguish meaningful bias from random variation.

Why this evaluation approach

Single‑factor, paired comparisons reduce confounding and make differences attributable to the attribute under test.
Multi‑method triangulation (LLM‑as‑a‑judge, rubric metrics, classical statistics) provides converging evidence rather than relying on any one metric.
Scalable and configurable: Persona libraries, question banks, and evaluation rubrics can evolve with domain needs while preserving methodological rigor.