www.lesswrong.com
TurnTrout's shortform feed â LessWrong
Comment by TurnTrout - Call for alpha testers for an AI control/security tool. A ton of alignment researchers YOLO their Claude usage right now. We run Claude on our computers without real protection (perhaps beyond auto mode) but there isn't an easy way to comply with known best practices. I wrote claude-guard, a wrapper to make best practices easy: just install and then your future claude sessions are protected.
Smart misaligned AI will target alignment researchers in particular for research sabotage, for example by:
1. Sabotaging their machines to delete key work later (claude-guard protects via sandboxing that goes beyond Claude Code's Bash-tool-only sandbox),
2. Poisoning training data using invisible unicode characters (the tool sanitizes inputs and outputs),
3. Removing traces of its bad behavior (I provide a tamper-evident audit log outside of the sandbox).
My goal is a low-latency, low-fuss product with easy-to-understand restrictions which can be minimally eased to enable e.g. a RunPod workflow. The ambitious vision is for claude-guard to become a staple open-source tool for AI alignment work, with people contributing improvements using the latest in AI control protocols. [1]
The tool is ready to use but has a bunch of rough edges. Please give it a try and open issues on GitHub: https://github.com/alexander-turner/claude-guard, or email me at alex@turntrout.com (I generally don't read comments). If you run into a blocking issue, please come back a day or two later and hopefully it will be fixed.
1. This work should complement Apollo's Watcher, currently a monitoring product. âŠī¸
Leggi l'articolo su www.lesswrong.com