Jacob Pfau

prof_pic.jpg

contact: [first].pfau@gmail.com

Research lead on the UK AISI Alignment Team. Our team is focused on worlds where alignment is hard and developing theoretically motivated methods to address these challenges. PhD student at NYU CDS. Current research projects include:

  • Debate for LLMs (empirical): training LLMs to debate via RL as an empirical test of scalable oversight.
  • Generative adversarial methods in RL for worst-case diversity guarantees, with applications to alignment.

I like to post about research on Twitter and Lesswrong. In the past I’ve written many prediction markets e.g. “Will an AI produce encyclopedia-worthy philosophy by 2026” on Manifold, and “Will transformer derived architectures accelerate progress in deep learning?” on Metaculus.

(Last updated in March 2026)

news

Feb 19, 2026 Our team led a £25M alignment grants round, engaging leading researchers in CS theory, economics, and ML to develop new alignment research agendas.
May 22, 2025 Posted Unexploitable search: blocking malicious use of free parameters to Lesswrong.
May 19, 2025 Our safety case sketch for debate is up on arxiv. Examines the details of a hypothetical AGI deployment context and works out what we’d need to know about the training data, dynamics, and objective to build a safety case around debate.
Apr 26, 2024 Posted Let’s Think Dot By Dot: Hidden Computation in Transformer Language Models to arXiv.
Feb 20, 2024 Posted Auditing LMs with counterfactual search: a tool for control and ELK to LessWrong.
Apr 26, 2023 Posted LM Situational Awareness, Evaluation Proposal: Violating Imitation to LessWrong.

latest posts

selected publications