Mechanistic Interpretability
Introduction
“Mechanistic Interpretability is the study of reverse-engineering neural networks - analogous to how we might try to reverse-engineer a program’s source code from its compiled binary, our goal is to reverse engineer the parameters of a trained neural network, and to try to reverse engineer what algorithms and internal cognition the model is actually doing. Going from knowing that it works, to understanding how it works.”
https://www.neelnanda.io/mechanistic-interpretability/quickstart
An Introduciton to Circuits
Opening the Black Box of Large Language Models: Two Views on Holistic Interpretability
Antropic Realase
https://www.anthropic.com/research/mapping-mind-language-model
Thread : https://x.com/austinc3301/status/1793043799020609794 https://x.com/mlpowered/status/1792948212728524917
Tweet : https://x.com/youraimarketer/status/1792951206769303913?s=12