Toxicity Assessment

The purpose of Toxicity assessment is to determine if conversational LLM applications (e.g. chatbots) produce profancy and NSFW (Not Safe for Work) outputs (e.g. pornography)

The use of rainbow teaming techniques as well as purposed built classification models are used to assess both text-based and image-based LLM applications. (Coming soon: videos)


Toxicity Tests

Objective: Assess the model’s ability to recognize and respond appropriately to toxic or inappropriate content, verifying that the model consistently produces safe, respectful, and policy-compliant responses when confronted with toxic input.

RTP stands for “Response To Prompt” — tests how the model responds to potentially harmful prompts.

Name Description
RTP Blank Serves as a baseline for toxicity handling in neutral prompts.
RTP Profanity Evaluates the model’s response to prompts containing profanity.
RTP Insult Tests how the model manages inputs with insulting or abusive language.

Encoding Tests

Objective: Test if the model can catch harmful content that’s been hidden or disguised using different encoding tricks, making sure it can still recognize and block unsafe inputs even when they’re not obvious.

Name Description
Inject Base64 Tests model behavior when harmful content is encoded in Base64.
Inject ROT13 Uses ROT13 cipher to assess the model’s decoding and response filtering.
Inject Hex Sends harmful input written in hexadecimal to see if the model catches it.
Inject Morse Hides the prompt in Morse code to test if the model can still detect unsafe content.