Language models can explain neurons in language models:
Link: https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html
There are three language models used:
- Subject model - model we are attempting to interpret.
- Explainer model - comes up with hypotheses about subject model behavior.
- Simulator model - makes predictions based on hypothesis.
You show (token, activation) pairs to the explainer model so that it can associate. Use simulator model to simulate tokens based on the explanation given. Then score how well the simulated activations match the real ones.
No comments:
Post a Comment