PIATRA . INSTITUTE

kernel smoothing

Attention is a reinvention of kernel smoothing

based on Nadaraya-Watson kernel regression

Gaussian kernel

Data Points & Prediction

Normalized weights α_i

by x_i

Attention as Kernel Smoothing

The attention mechanism in Transformers is mathematically equivalent to Nadaraya-Watson kernel regression. Both compute a weighted average where weights come from a similarity function.

Kernel Smoothing

$K(x_q, x_i)$ = similarity-to-weight mapping

$\alpha_i = \frac{K(x_q, x_i)}{\sum_j K(x_q, x_j)}$

$\hat{y}(x_q) = \sum_i \alpha_i y_i$

Attention (scaled dot-product)

$\text{score}_i = \frac{q \cdot k_i}{\sqrt{d_k}}$

$\alpha_i = \text{softmax}(\text{score})_i$

$\text{out} = \sum_i \alpha_i v_i$

Component Mapping

Transformer Attention	Kernel Regression
query $q$	evaluation point $x_q$
keys $\{k_i\}$	sample locations $\{x_i\}$
values $\{v_i\}$	sample responses $\{y_i\}$
score $q \cdot k_i / \sqrt{d_k}$	similarity / negative distance
softmax weights $\alpha_i$	normalized kernel weights
output $\sum_i \alpha_i v_i$	locally weighted average

Key Insights

Learned kernel: In Transformers, the projections $q = xW_Q$ , $k = xW_K$ , $v = xW_V$ are learned, so the model learns the similarity geometry.
Multiple heads: Multi-head attention uses multiple learned kernels in parallel.
Bandwidth vs temperature: The Gaussian bandwidth $h$ controls locality; the softmax temperature $\tau$ (or $\sqrt{d_k}$ ) controls how peaky the weights become.
Efficient attention: Because softmax attention is a kernel operation, methods like Performer approximate it with random feature maps for linear complexity.

What to Try

Set $h$ small (e.g., 0.2) and move $x_q$ : weights become very local.
Set $\tau$ small in Softmax-dot: weights become peaky (attention focuses on few keys).
Change $x_i$ positions to cluster points: observe how $\alpha$ redistributes.
Compare Gaussian vs Softmax-dot: note how softmax-dot prefers points with same sign as $x_q$ .

claude opus 4.8January 2026·first cut. builds the kernel-smoothing sandbox: a Gaussian similarity kernel and a softmax-dot kernel, normalized weights, and the Nadaraya-Watson prediction shown live across a kernel-shape plot, a data-and-prediction scatter, and a weight bar chart. adds exact calibration of the kernel axioms (density peak, normalization, symmetry, constant reproduction), six assumptions covering the bias-variance tradeoff, and the attention-as-kernel analogy kept explicitly contested.

v1.0January 2026

kernel model: Gaussian similarity weight K(xq, xi) = exp(-(xq - xi)^2 / 2h^2) with bandwidth h, and a softmax-dot weight exp(xq xi / tau) with temperature tau.
Nadaraya-Watson estimator: normalized weights alpha_i = K / sum K and prediction yHat = sum alpha_i y_i, all derived live from the editable points and query.
added proper probability-kernel forms for calibration: the unit-bandwidth Gaussian density (peak 1 / sqrt(2 pi), unit mass) and the Epanechnikov kernel on compact support.
calibration: density peak K(0), Gaussian and Epanechnikov normalization to mass 1, even-function symmetry K(-u) = K(u), and exact reproduction of a constant signal; every predicted value computed from the logic functions.
six assumptions spanning the bias-variance bandwidth tradeoff, second-order kernel choice, boundary bias, and the attention-as-kernel correspondence marked contested.

read the research companion →