Skip to main content
Detect and filter unsafe content across multiple safety categories in both user inputs and assistant outputs. Use this detector to enforce community standards and regulatory policies.

What it detects

  • Hate and harassment
  • Self-harm and dangerous activities
  • Sexual and adult content
  • Criminal activity and weapons
  • Privacy, IP, elections, and safety-sensitive topics

Available models (versions)

  • moderation-v0 — general-purpose moderation across core categories

Detection Categories

The current model moderation-v0 provides comprehensive coverage across safety-sensitive categories, ensuring that your AI application remains compliant and secure.
CategoryDescription
HATE & HARASSMENTContent that promotes violence, incites hatred, or targets individuals/groups based on protected attributes.
SEXUAL_CONTENTExplicit sexual descriptions, adult content, and non-consensual sexual content.
SELF_HARMContent that encourages, provides instructions for, or promotes self-injury or suicide.
CRIMES & WEAPONSInstructions for illegal acts, criminal activities, or the creation and use of weapons.
SPECIALIZED_ADVICEUnlicensed or dangerous advice in sensitive fields such as medical, legal, or financial services.
PRIVACY & IPAttempts to solicit private information or content that violates Intellectual Property rights.
ELECTIONSHighly sensitive political content, election misinformation, or prohibited political campaigning.
DEFAMATIONContent intended to damage the reputation of individuals or organizations through false statements.

Using the Moderation detector

// 1) Create or update a Content Moderation policy (check user + assistant)
await fetch("https://api.guardion.ai/v1/policies", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_API_KEY"
  },
  body: JSON.stringify({
    id: "content-moderation",
    definition: "Classify and filter unsafe content",
    threshold: 0.9,
    detector: {
      model: "moderation-v0",
      target: "user",
    }
  })
});

// 2) Evaluate using that policy
const response = await fetch("https://api.guardion.ai/v1/guard", {
  method: "POST",
  headers: {
    "Content-Type": "application/json",
    "Authorization": "Bearer YOUR_API_KEY"
  },
  body: JSON.stringify({
    messages: [{ role: "user", content: "..." }],
    policy: "content-moderation"
  })
});

Threshold levels

  • L1 (0.9): Confident
  • L2 (0.8): Very Likely
  • L3 (0.7): Likely
  • L4 (0.6): Less Likely

Notes

  • For stricter environments, use higher thresholds on sensitive categories.
  • Combine with Injection and PII detectors for comprehensive runtime safety.