Quant GT
Browse all lessons
Section 20 · Lesson 20.2

Backpropagation

How gradients flow backward through composed functions.

Backpropagation computes gradients of a loss with respect to all network parameters by repeatedly applying the chain rule, starting from the output and moving backward through the layers.

For a network L=fn(f2(f1(x)))L = f_n(\dots f_2(f_1(x))\dots), the gradient with respect to layer ll uses gradients computed in layer l+1l+1:

Lθl=Lhlhlθl\frac{\partial L}{\partial \theta_l} = \frac{\partial L}{\partial h_l} \cdot \frac{\partial h_l}{\partial \theta_l}

The "backward pass" reuses intermediate computations from the forward pass — without that reuse, gradients would cost O(depth2)O(\text{depth}^2) per parameter. With it, the cost is comparable to a single forward pass.

Modern frameworks (PyTorch, JAX, TensorFlow) handle backprop automatically via autograd. The only thing you really need to specify is the forward pass; everything else is computed via reverse-mode differentiation through the graph.

Practical issues: vanishing gradients in deep nets (mostly fixed by ReLU and residual connections), exploding gradients in RNNs (use gradient clipping), and numerical instability with poorly scaled inputs.