Jacob Pfau
contact: [first].pfau@gmail.com
Cofounder and Scalable Oversight lead at Sequent, a nonprofit research organization pursuing scale and automation for higher confidence in AI alignment.
Previously: PhD at NYU CDS, and research lead on the UK AISI Alignment Team. Ongoing research interests include:
- Debate for LLMs (empirical): training LLMs to debate via RL as an empirical test of scalable oversight.
- Generative adversarial methods in RL for worst-case diversity guarantees, with applications to alignment.
I like to post about research on Twitter and Lesswrong. In the past I’ve written many prediction markets on Manifold.
news
| Feb 19, 2026 | Our team led a £25M alignment grants round, engaging leading researchers in CS theory, economics, and ML to develop new alignment research agendas. |
|---|---|
| May 22, 2025 | Posted Unexploitable search: blocking malicious use of free parameters to Lesswrong. |
| May 19, 2025 | Our safety case sketch for debate is up on arxiv. Examines the details of a hypothetical AGI deployment context and works out what we’d need to know about the training data, dynamics, and objective to build a safety case around debate. |
| Apr 26, 2024 | Posted Let’s Think Dot By Dot: Hidden Computation in Transformer Language Models to arXiv. |
| Feb 20, 2024 | Posted Auditing LMs with counterfactual search: a tool for control and ELK to LessWrong. |
| Apr 26, 2023 | Posted LM Situational Awareness, Evaluation Proposal: Violating Imitation to LessWrong. |