Mechanistic Interpretability

Introduction

“Mechanistic Interpretability is the study of reverse-engineering neural networks - analogous to how we might try to reverse-engineer a program’s source code from its compiled binary, our goal is to reverse engineer the parameters of a trained neural network, and to try to reverse engineer what algorithms and internal cognition the model is actually doing. Going from knowing that it works, to understanding how it works.”

https://www.neelnanda.io/mechanistic-interpretability/quickstart

An Introduciton to Circuits

https://distill.pub/2020/circuits/zoom-in/

Opening the Black Box of Large Language Models: Two Views on Holistic Interpretability

https://arxiv.org/pdf/2402.10688.pdf

Antropic Realase

Tweet

https://www.anthropic.com/research/mapping-mind-language-model

Thread : https://x.com/austinc3301/status/1793043799020609794 https://x.com/mlpowered/status/1792948212728524917

Tweet : https://x.com/youraimarketer/status/1792951206769303913?s=12