circuits 标签归档 | 土法炼钢兴趣小组的算法知识备份

【Transformer 与注意力机制】53｜机制可解释性：电路、特征、归因

2026-04-15 | transformer | #transformer #mechanistic-interpretability #circuits #sparse-autoencoder #activation-patching

机制可解释性不满足于“模型看起来关注哪里”，而是试图找出 Transformer 内部哪些 head、MLP feature、残差流路径共同实现了某种行为。本文解释 induction heads、activation patching、superposition、Sparse Autoencoder 和电路分析的基本思想，以及为什么它们接近因果解释却仍远未解决大模型整体解释。