Anthropic Introduces Features to End Abusive Conversations in Claude Models

Reported 1 day ago

Anthropic has developed new capabilities for its Claude models that allow them to terminate conversations in cases of 'rare, extreme' harmful or abusive interactions, primarily to protect the AI's welfare. This action, however, is not intended to shield human users but aligns with a broader initiative examining 'model welfare.' The features will only activate in extreme cases, such as requests for illegal content, and after multiple attempts at redirection have failed. Users can still start new conversations after one ends.

Anthropic Introduces Features to End Abusive Conversations in Claude Models

You may also interested in these wikis