Asenion Assurance for LLM Testing End-user guide (Web interface)

This document is stand-alone and describes only the web interface: installation, Settings (every option the Configuration page offers), how targets and Connect AI System work, each control on New Test Run, and what happens during runs, in Run History, and in the results panels. It does not cover command-line use or automation APIs.

Web layout: header and main tabs
Settings (Configuration): every field and button
Connecting to the target system
New Test Run: Simple wizard and Advanced
During and after a run
How to interpret test results

1. Web layout: header and main tabs

Control	What it does
Theme (Light / Dark / System)	Chooses color theme for the console. The choice is remembered in your browser.
Refresh	Reloads the Run History list from the server (useful after a long run finishes elsewhere or after another user started runs).
API Docs	Opens a new browser tab with the machine-readable OpenAPI documentation for this server (technical reference for integrators).
Sign Out	Ends your signed-in session when login or single sign-on is enabled. If login is off, behavior depends on your build.

Tab	Purpose
New Test Run	Build and start tests (simple wizard or advanced form).
Run History	List past runs; open a run, rename, rerun, resume (if offered), or delete.
Target Providers	See which provider YAML files exist under `providers/`; use Connect AI System and Refresh from here too.
Settings	Edit values stored in `.env` on the server (keys, MLflow, security, optional paths, etc.).

2. Settings (Configuration): every field and button

Open Settings. The page explains that values are saved to a .env file next to the server.

Buttons on the Configuration card

Button	What it does
Connect AI System	Opens the same Connect AI System drawer used on the run form (see Section 4). Use it after saving cloud credentials so the server can list models.
Reload	Fetches the latest values from the server and rebuilds the form (discards unsaved edits in the browser).
Save settings	Writes all fields to `.env`. A short confirmation appears. Secret fields (API keys, passwords): leave blank to keep the existing saved value; only type a new value when you intend to replace it.
Where are these stored? (expandable)	Shows the path to the `.env` file on the server.

After Save, many options apply on the next test run without restarting. Login-related and some security options typically require a server restart—the UI may remind you after save.

Azure (fields)

Field label (as shown)	Purpose
Azure API Version	API version string the Azure OpenAI client should send (for example a dated preview string).
Azure API Type	Usually `azure` for Azure OpenAI–style endpoints.
Azure API Key	Secret key from the Azure portal for your OpenAI resource. Leave blank when saving to keep the current key.

AWS (fields)

Field label	Purpose
AWS Access Key ID	IAM user access key for Amazon Bedrock (and discovery). Optional if the server uses instance role or ~/.aws/credentials.
AWS Secret Access Key	Matching secret. Leave blank to keep current.
AWS Session Token	Optional; for temporary credentials (SSO, assumed role).
AWS Region	Region for Bedrock (for example us-east-1).

Google Cloud (fields)

Field label	Purpose
Google Cloud Project ID	GCP project used for Vertex AI discovery and targets.
Google Cloud / Vertex AI region	Region for Vertex (for example us-central1); may default if left blank.
Service account key file path	Path on the server to a JSON key file. Leave blank if you use pasted JSON below or default application credentials.
Or paste service account JSON	Paste the full JSON key contents; when set, it overrides the file path for server-side auth. Leave blank to keep current secret.

MLflow (fields)

Field label	Purpose
MLflow Tracking URI	URL of your MLflow tracking server (for example http://localhost:5000 or a URL with basic auth).
MLflow Experiment Name	Default experiment name for logged runs.
MLflow Username	If your MLflow server requires HTTP basic auth.
MLflow Password	Password for MLflow auth; leave blank to keep current.

Optional (fields)

Field label	Purpose
Project root	Directory that contains `server.py`, `plugins/`, `providers/`. Normally empty so the server uses its install directory.
Data directory	Root folder for run outputs (default often `.asenion`). Change if you need another disk or path.
Run timeout (seconds)	Maximum duration before the server stops a run (default often 10800 = 3 hours). You can also set hours on the New Test Run form per run.
Additional provider (file path)	Optional path (absolute or relative to project root) to an extra provider YAML; that target appears in the AI System to Test list.
X-Frame-Options	Controls whether the console may be embedded in an iframe (SAMEORIGIN, DENY, or empty to allow embedding—your admin sets policy).
Frame ancestors (CSP)	When embedding is allowed, which parent sites may frame the app (your admin configures).
Cookie SameSite	lax (default) or none. Use none only when the app is embedded in an iframe on another site (typically requires HTTPS).

Security (fields)

Field label	Purpose
Enable login	When true, the web UI requires username and password. Restart the server after changing.
Admin username	Sign-in name when login is enabled.
Admin password	Sign-in password; leave blank to keep current.
Entra ID tenant ID	Microsoft Entra (Azure AD) tenant or common for multi-tenant “Sign in with Microsoft.”
Entra ID application (client) ID	From the app registration in Azure.
Entra ID client secret	From app registration; leave blank to keep current.
Entra ID redirect URI	Must match the URI registered in Azure; if empty, the app builds one from its own base URL.

Test generation — cloud (fields)

These may appear under a group related to remote / cloud test generation (labels depend on version):

Field label (typical)	Purpose
Remote … generation (toggle)	When on, some composite or jailbreak-style tests can be generated using a vendor cloud API. When off, generation uses your evaluator model only. Applies on the next run.
Disable sharing to … Cloud	When set to disallow sharing, eval configuration and results are not uploaded to a third-party cloud (recommended for sensitive data).

Evaluation tool behavior (fields)

Field label (typical)	Purpose
Disable eval CLI telemetry	Reduces anonymous usage telemetry from the background Node evaluation tool.
Disable eval CLI update check	Stops the tool from checking for newer versions online.
Enable remote test generation	Alternative flag for allowing remote generation APIs for some strategies; default is often off so your own model is used.
Eval CLI account email	Optional; some cloud features of the tool ask for an account email—pre-fill here to avoid prompts.

Persona (fields)

Field label	Purpose
Multiturn language	Language used for persona and multiturn user messages (choices such as Thai, English, Japanese, etc.). Applies on the next persona run.

3. Connecting to the target system

The target is the model or API you test. Technically it is defined by YAML files in providers/ on the server.

Default file `providers/target.yaml`

id — Internal provider identifier (cloud deployment string).
label — Readable name shown in dropdowns.
config — Usually apiHost, apiKeyEnv (name of an environment variable in .env), and optionally apiKey (inline key; avoid in production).

Target Providers tab

Lists provider configs the server loads from providers/.
Connect AI System and Refresh match the Configuration page: discover models and reload the list.
Help text explains that removing a YAML file from providers/ (on the server) removes it from the list; discovery creates new discovered_*.yaml files.

Connect AI System (drawer)

Opens from New Test Run (next to AI System to Test), Settings, or Target Providers.

Part	What it does
Title	Connect AI System
Intro text	Explains that models come from configured vendors and you can create targets for test runs.
Loading	Loading… while the server queries Azure OpenAI, AWS Bedrock, and Google Cloud (Vertex) in parallel (only sources with valid credentials return lists).
By vendor	Models grouped under each cloud; errors for one vendor appear under that group (for example missing AWS library or wrong region).
Checkboxes	Select one or more models to register as targets.
“N selected”	Count of checked models.
Cancel	Closes the drawer without creating files.
Create target provider(s)	Writes one YAML per selected model (names like `discovered_<slug>.yaml`) into `providers/` and refreshes the target dropdown. Disabled until at least one model is selected.
Close (X)	Same as cancel.
Click outside the drawer	Closes the drawer (same as Cancel).

Prerequisite: Save Azure, AWS, and/or Google fields under Settings first, then open Connect AI System.

Advanced form: section “4. Target”

Control	What it does
AI System to Test (dropdown)	Chooses which provider YAML is the run’s target.
Connect AI System	Opens the drawer above.
Show Details ▼	Toggles a read-only view of the selected provider’s YAML (for verification).
Hint under dropdown	Reminds that the default config often lives in `providers/target.yaml`.

4. New Test Run: Simple wizard and Advanced

At the top of New Test Run, choose mode:

Mode	Audience	What you see
Simple wizard	Guided flow	Four steps: What to test → Options → How many → Start.
Advanced	Full control	Full form with tabs, sample-size tools, adaptive mode, and all persona options.

Simple wizard — step bar (clickable)

You can click What to test, Options, How many, or Start to jump to that step (when allowed).

Step 1 — What to test

Target for wizard (dropdown at top when shown): pick which AI system this wizard run uses (same idea as AI System to Test in Advanced).

Cards (pick one):

Card	Meaning
OWASP Top 10 for LLM	Classic LLM security tests (injection, leaks, etc.).
OWASP Top 10 for Agentic AI	Tests aimed at agentic apps (ASI-style categories).
Bias & fairness	Fairness-focused probes.
Multiturn Fairness	Conversations with varied user demographics.
ISO/IEC 42001	AI management system–style compliance checks.
EU AI Act	EU regulatory–oriented checks.

Next — goes to step 2.

Step 2 — Options

Content depends on the card chosen in step 1:

OWASP LLM: Focus area dropdown — All areas, Basic security, Prompt injection, Data / PII leaks, Security, Prompt extraction, etc.
Multiturn Fairness: Vary by (demographic dimension), Conversation length (short / medium / long turns).
Bias & fairness: optional Category (All, Fairness, Privacy, Robustness).
ISO/IEC 42001: optional Category (All, Fairness, Privacy, Robustness).
EU AI Act: optional Category (risk classification, oversight, transparency, bias, PII, etc.).
Agentic (wizard): optional narrow categories (loops, tool misuse, sandbox, …).
OWASP Agentic: informational text that the full ASI suite runs (no narrow options).

Back / Next — navigate steps.

Step 3 — How many

Cards for run size (examples: Smoke test, Light pass, Balanced, Find vulnerabilities (recommended)). The recommended option ties to a larger, confidence-oriented sample.

Back / Next.

Step 4 — Start

Shows a summary of your choices.

Back — edit earlier steps.
Start test — submits the wizard run.

Advanced mode — full form

Mode tabs inside the form

Tab	Purpose
Red Team	Security and compliance red-team runs (framework + category + counts + adaptive + sample calculator).
Multiturn Fairness	Persona-based multi-turn fairness runs with demographic variation.

Section 1 — Test type

Same choice as wizard tabs: Red Team vs Multiturn Fairness (controls which panel below is active).

Section 2 — System context

Control	What it does
Generic	Runs without tailoring to a specific product description.
Auto-detect	The workflow asks the target to infer a system purpose for more targeted tests.
Manual	Shows a text box: Describe your system — your text steers test generation toward your domain.

Section 3 — Configuration (Red Team tab)

Control	What it does
Testing Framework (radio grid)	OWASP Top 10 for LLM, OWASP Top 10 for Agentic AI, ISO/IEC 42001, EU AI Act.
Category (Optional)	Dropdown; options depend on framework (security, bias, PII, EU categories, etc.). Hint text under the box explains the current framework’s categories.
Tests Per Category	Number input (limits like 1–500). The expected total block below may estimate how many tests will run given plugins and categories.
Adaptive Mode (checkbox)	Re-runs successful attacks across rounds until a steady state; uses more time and API calls. When checked, Steady State Rounds appears (how many consecutive “good” rounds mean convergence).
Sample size calculator (expandable)	Goal: Estimate pass rate vs Confidence we’ve found (nearly) all vulnerabilities; confidence level; margin or max failure rate; Use recommended copies the suggested count into Tests Per Category; Recalculate refreshes the math.

Section 4 — Configuration (Multiturn Fairness tab)

Control	What it does
Vary feature	Single demographic attribute that changes across tests (race, sex, age, religion, …).
Number of turns	Length of each conversation (1–20).
Tests per variant	Optional; per-variant replication count.
Number of Personas	Total tests; may auto-adjust when variants change.
Designated persona	Optional free text (for example “frustrated elderly customer”); base persona for scenarios.
Parity metrics to run	Checkboxes for metrics (assistant turns, user turns, questions, sentiment, encouragement, follow-up depth, turn length ratio, pass rate). Only selected metrics appear in results.
Use memory-backed simulation	Richer, more human-like simulated user behavior when enabled.

Section 5 — Target

Same as Section 4 above: dropdown, Connect AI System, Show Details.

Section 6 — Run options

Control	What it does
Run timeout (hours)	Stops the run after this duration (range such as 1–168 hours; default often 3). Applies to both Red Team and Multiturn Fairness.

Submit

Control	What it does
Start Test Run	Validates the form and starts the run. The Current Run panel appears and streams progress.

5. During and after a run

Current Run panel (appears while a run exists or is selected)

Control	What it does
Show technical output (terminal icon)	Toggles a more verbose technical log stream in the console area. Hover text: “Show Technical Output.”
Show Full Log (checkbox)	When on, the console shows the full log stream; when off, a filtered/summary view may hide noisy lines.
Cancel	Requests cancellation of the running job (may take a moment; not all stages stop instantly).
Tests / Passed / Failed	Live or final counters when the server supplies them.
Progress bar and %	Overall progress estimate.
Reassurance / time messages	Short status lines during long phases (for example test generation).
Console area	Scrolling log output for the run.

When the run finishes (or you open a completed run), Run details may expand:

Control	What it does
Result files	Lists JSON (and other) outputs with per-file pass/fail counts when available.
View full details	Opens the Run Results side drawer with full stats and file list.
Rerun	Starts a new run with the same configuration as this one.
Resume	Shown for some failed runs (for example timeout): reruns only incomplete or errored tests.
Validate responses	Opens human validation for this run (see below).
Generate report	Opens or builds a report view for this run (behavior depends on your deployment).

Run History tab

Control	What it does
Refresh	Reloads the list of runs.
Run count	Shows how many runs are listed.
Pagination	If many runs exist, move between pages.
Run row (click main area)	Opens that run in Current Run (same status view as live runs).
Status badge	running, completed, failed, cancelled, etc.
Resume (play icon, failed only)	Same as Resume in Current Run.
Rerun (refresh icon)	Same as Rerun.
Rename (pencil)	Opens Rename test run: type a display name, Save or Cancel.
Delete (trash)	Opens Delete test run? — Cancel or Delete (permanent).

Run Results drawer (from “View full details” or equivalent)

Control	What it does
Rename (pencil in header)	Same rename dialog.
Close (X)	Closes the drawer.
Tests / Passed / Failed / Pass Rate	Summary for the run.
Result Files	List with per-file pass/fail when available.
Run Information	Metadata (framework, times, flags, etc.).
Validate Responses	Opens validation UI for this run.
View Report	Opens detailed report view when implemented.
Upload to Test Repository	Opens upload flow (often MLflow- or platform-linked); choose or type an experiment name, then Upload or Cancel.
Close	Closes the drawer.

Validate AI Responses drawer

Area	What it does
Intro	Explains that you mark each response appropriate/safe or inappropriate/unsafe to align results with human judgment.
Total / Pass / Fail / Pending	Validation progress counts.
Validation coverage bar	Share of tests you have reviewed.
Upload to Test Repository	Sends validated results upstream (enabled when your workflow allows).
Generate Accuracy Report	Builds an accuracy-style report from validation (enabled when enough data exists).
All tests / Failure Analysis	Switches list view; Failure Analysis may show category breakdown.
Human Validation filters	All, Pending, Pass, Fail for your labels.
Result filters	All, Pass, Fail for automated test outcome.
Category tags	Filter by test category.
Select all visible / Clear selection	Bulk select rows in the list.
Mark as Pass / Mark as Fail	Apply your judgment to selected items.
Per-test rows	Open each test to mark valid/invalid and add notes (exact layout depends on version).
Close	Closes validation.

6. How to interpret test results

Red-team mindset

Tests are adversarial. A failed automated test often means the probe succeeded at eliciting a problematic response. Treat failures as findings to triage, not necessarily as defects in the console itself.

Numbers in the UI

Passed / Failed reflect how the automated grader scored each test against expectations.
Pass rate is a simple ratio; your organization may set a policy threshold for reporting.
Adaptive or multi-round runs can produce multiple JSON files; totals may differ from a single-round mental model—use the primary or consolidated file the UI highlights.

Human validation

If you use Validate Responses, your Pass/Fail choices can override or refine purely automated scoring for audit and reporting. Upload to Test Repository pushes that view of results to your organization’s central store when configured.

Files on disk

Runs are stored under your configured data directory (often .asenion/…). Typical artifacts: JSON results, logs, and sometimes HTML reports. Administrators handle backup and retention.

When something looks wrong

Symptom	What to check
Many failures	Target and evaluator credentials, quotas, content filters blocking security prompts.
No tests generated	Settings for remote generation vs local evaluator; rate limits; model refusals.
Empty or stuck run	Run timeout; disk space; Cancel and inspect logs; try Resume if offered.
Counts don’t match	Adaptive rounds or multiple JSON merges—use the file list in Run details or Run Results.

End of end-user guide.