Jacob Pfau

contact: [first].pfau@gmail.com

Research scientist on the UK AISI Alignment Team. PhD student at NYU CDS. Current research projects include:

Scalable oversight: Broadly, what methods and evals do we need to complement debate? our debate safety case sketch here gives a high-level overview of this research direction.
Aggregating natural language explanations and Bayesian methods.

I like to post about research on Twitter and Lesswrong. In the past I’ve written many prediction markets e.g. “Will an AI produce encyclopedia-worthy philosophy by 2026” on Manifold, and “Will transformer derived architectures accelerate progress in deep learning?” on Metaculus.

(Last updated in May 2025)

news

May 22, 2025	Posted Unexploitable search: blocking malicious use of free parameters to Lesswrong.
May 19, 2025	Our safety case sketch for debate is up on arxiv. Examines the details of a hypothetical AGI deployment context and works out what we’d need to know about the training data, dynamics, and objective to build a safety case around debate.
Apr 26, 2024	Posted Let’s Think Dot By Dot: Hidden Computation in Transformer Language Models to arXiv.
Feb 20, 2024	Posted Auditing LMs with counterfactual search: a tool for control and ELK to LessWrong.
Apr 26, 2023	Posted LM Situational Awareness, Evaluation Proposal: Violating Imitation to LessWrong.

latest posts

selected publications