kernel smoothing

Attention is a reinvention of kernel smoothing

Gaussian kernel

Data Points & Prediction

Normalized weights αi

by xi

Attention as Kernel Smoothing

The attention mechanism in Transformers is mathematically equivalent to Nadaraya-Watson kernel regression. Both compute a weighted average where weights come from a similarity function.

Kernel Smoothing

K(xq,xi)K(x_q, x_i) = similarity-to-weight mapping

αi=K(xq,xi)jK(xq,xj)\alpha_i = \frac{K(x_q, x_i)}{\sum_j K(x_q, x_j)}

y^(xq)=iαiyi\hat{y}(x_q) = \sum_i \alpha_i y_i

Attention (scaled dot-product)

scorei=qkidk\text{score}_i = \frac{q \cdot k_i}{\sqrt{d_k}}

αi=softmax(score)i\alpha_i = \text{softmax}(\text{score})_i

out=iαivi\text{out} = \sum_i \alpha_i v_i

Component Mapping

Transformer AttentionKernel Regression
query qqevaluation point xqx_q
keys {ki}\{k_i\}sample locations {xi}\{x_i\}
values {vi}\{v_i\}sample responses {yi}\{y_i\}
score qki/dkq \cdot k_i / \sqrt{d_k}similarity / negative distance
softmax weights αi\alpha_inormalized kernel weights
output iαivi\sum_i \alpha_i v_ilocally weighted average

Key Insights

  • Learned kernel: In Transformers, the projections q=xWQq = xW_Q, k=xWKk = xW_K, v=xWVv = xW_V are learned, so the model learns the similarity geometry.
  • Multiple heads: Multi-head attention uses multiple learned kernels in parallel.
  • Bandwidth vs temperature: The Gaussian bandwidth hh controls locality; the softmax temperature τ\tau (or dk\sqrt{d_k}) controls how peaky the weights become.
  • Efficient attention: Because softmax attention is a kernel operation, methods like Performer approximate it with random feature maps for linear complexity.

What to Try

  • Set hh small (e.g., 0.2) and move xqx_q: weights become very local.
  • Set τ\tau small in Softmax-dot: weights become peaky (attention focuses on few keys).
  • Change xix_i positions to cluster points: observe how α\alpha redistributes.
  • Compare Gaussian vs Softmax-dot: note how softmax-dot prefers points with same sign as xqx_q.