Friday, May 1, 2026

Language models can explain neurons in language models:

Language models can explain neurons in language models:

Link: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

    There are three language models used:

  • Subject model - model we are attempting to interpret.
  • Explainer model - comes up with hypotheses about subject model behavior.
  • Simulator model - makes predictions based on hypothesis.

    You show (token, activation) pairs to the explainer model so that it can associate. Use simulator model to simulate tokens based on the explanation given. Then score how well the simulated activations match the real ones.

Kicking Off

     For this blog, I will regularly read AI research papers, summarize them, and discuss my findings. The point of this is to rapidly upskill in the area, and develop a taste for judging research directions. I'll start with a deep dive into understanding individual neurons, and how an auto-interp pipeline could be useful for automated interp/AI safety work.

Language models can explain neurons in language models:

Language models can explain neurons in language models: Link: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html  ...