Language models can explain neurons in language models:

Friday, May 1, 2026

Language models can explain neurons in language models:

Link: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

There are three language models used:

Subject model - model we are attempting to interpret.
Explainer model - comes up with hypotheses about subject model behavior.
Simulator model - makes predictions based on hypothesis.

You show (token, activation) pairs to the explainer model so that it can associate. Use simulator model to simulate tokens based on the explanation given. Then score how well the simulated activations match the real ones.

AIS - Research Papers

Friday, May 1, 2026

Language models can explain neurons in language models:

No comments:

Post a Comment

Language models can explain neurons in language models:

Report Abuse