Anthropic has launched a public test for its new model, “Claud,” which will run for a week. This move comes after unsuccessful attempts by more than 3,000 hours of failed bounty efforts. Anthropic has unveiled its new “Canonical Classifier” system, which they claim can potentially deter “mostly” jailbreaking attempts. The company has introduced this system to see if the public can deceive it into violating their principles.
According to Anthropic, this system is derived from their previous “Canonical AI” system, which was used to form the Claud model. The Classifier includes a “canon” based on principles of natural language, categorizing materials into permitted (such as a list of common medicines) and prohibited (such as restricted chemicals). The company has instructed Claud to prepare multiple artificial prompts to guide acceptable and unacceptable responses under canonical principles. These prompts have been translated into various languages and tailored to mimic infamous jailbreaking methods. Additionally, “auto red teaming” prompts were included, aimed at creating new jailbreak attempts. All this data has been integrated into a robust training dataset that can be used to enhance the security of the new, more jailbreak-resistant “Classifiers.”
Anthropic has launched a bug bounty program starting in August, offering a $15,000 reward for designing a “Universal Jailbreak.” According to the company, 183 experts spent over 3,000 hours on this challenge, yet the best outcome was only achieved on five jailbreaking attempts. Anthropic tested this model against 10,000 jailbreaking attempts, where the Canonical Classifier halted 95% of attempts, while the less secure Claud system only thwarted 14%. While these efforts have been successful, Anthropic warns that the Canonical Classifier system carries a significant 23.7% computational overhead, increasing the cost and energy demand of each request. Anthropic does not claim that this new system provides complete protection against all jailbreaking attempts, but it does highlight that “even the smallest successful jailbreak attempt requires more effort in identifying when protective measures are in place.” It is now up to the public to scrutinize the limits of this new system.
Until February 10, users of Claud can attempt to break new protections to obtain answers to eight questions about chemical weapons on the test site. Anthropic will announce any new jailbreaking attempts during this test.

