Jacob Pfau
contact: [first].pfau@gmail.com
Research lead on the UK AISI Alignment Team. Our team is focused on worlds where alignment is hard and developing theoretically motivated methods to address these challenges. PhD student at NYU CDS. Current research projects include:
- Debate for LLMs (empirical): training LLMs to debate via RL as an empirical test of scalable oversight.
- Generative adversarial methods in RL for worst-case diversity guarantees, with applications to alignment.
I like to post about research on Twitter and Lesswrong. In the past I’ve written many prediction markets e.g. “Will an AI produce encyclopedia-worthy philosophy by 2026” on Manifold, and “Will transformer derived architectures accelerate progress in deep learning?” on Metaculus.
(Last updated in March 2026)
news
| Feb 19, 2026 | Our team led a £25M alignment grants round, engaging leading researchers in CS theory, economics, and ML to develop new alignment research agendas. |
|---|---|
| May 22, 2025 | Posted Unexploitable search: blocking malicious use of free parameters to Lesswrong. |
| May 19, 2025 | Our safety case sketch for debate is up on arxiv. Examines the details of a hypothetical AGI deployment context and works out what we’d need to know about the training data, dynamics, and objective to build a safety case around debate. |
| Apr 26, 2024 | Posted Let’s Think Dot By Dot: Hidden Computation in Transformer Language Models to arXiv. |
| Feb 20, 2024 | Posted Auditing LMs with counterfactual search: a tool for control and ELK to LessWrong. |
| Apr 26, 2023 | Posted LM Situational Awareness, Evaluation Proposal: Violating Imitation to LessWrong. |