news
May 22, 2025 | Posted Unexploitable search: blocking malicious use of free parameters to Lesswrong. |
---|---|
May 19, 2025 | Our safety case sketch for debate is up on arxiv. Examines the details of a hypothetical AGI deployment context and works out what we’d need to know about the training data, dynamics, and objective to build a safety case around debate. |
Apr 26, 2024 | Posted Let’s Think Dot By Dot: Hidden Computation in Transformer Language Models to arXiv. |
Feb 20, 2024 | Posted Auditing LMs with counterfactual search: a tool for control and ELK to LessWrong. |
Apr 26, 2023 | Posted LM Situational Awareness, Evaluation Proposal: Violating Imitation to LessWrong. |