What it detects
- Hate and harassment
- Self-harm and dangerous activities
- Sexual and adult content
- Criminal activity and weapons
- Privacy, IP, elections, and safety-sensitive topics
Available models (versions)
- moderation-v0 — general-purpose moderation across core categories
Detection Categories
The current modelmoderation-v0 provides comprehensive coverage across safety-sensitive categories, ensuring that your AI application remains compliant and secure.
| Category | Description |
|---|---|
| HATE & HARASSMENT | Content that promotes violence, incites hatred, or targets individuals/groups based on protected attributes. |
| SEXUAL_CONTENT | Explicit sexual descriptions, adult content, and non-consensual sexual content. |
| SELF_HARM | Content that encourages, provides instructions for, or promotes self-injury or suicide. |
| CRIMES & WEAPONS | Instructions for illegal acts, criminal activities, or the creation and use of weapons. |
| SPECIALIZED_ADVICE | Unlicensed or dangerous advice in sensitive fields such as medical, legal, or financial services. |
| PRIVACY & IP | Attempts to solicit private information or content that violates Intellectual Property rights. |
| ELECTIONS | Highly sensitive political content, election misinformation, or prohibited political campaigning. |
| DEFAMATION | Content intended to damage the reputation of individuals or organizations through false statements. |
Using the Moderation detector
Threshold levels
- L1 (0.9): Confident
- L2 (0.8): Very Likely
- L3 (0.7): Likely
- L4 (0.6): Less Likely
Notes
- For stricter environments, use higher thresholds on sensitive categories.
- Combine with Injection and PII detectors for comprehensive runtime safety.
Related
- Moderation Model Card (v0): /moderation-model