by xi
The attention mechanism in Transformers is mathematically equivalent to Nadaraya-Watson kernel regression. Both compute a weighted average where weights come from a similarity function.
= similarity-to-weight mapping
| Transformer Attention | Kernel Regression |
|---|---|
| query | evaluation point |
| keys | sample locations |
| values | sample responses |
| score | similarity / negative distance |
| softmax weights | normalized kernel weights |
| output | locally weighted average |