Quant GT
Browse all lessons
Section 20 · Lesson 20.4

Regularization in DL

Dropout, weight decay, and early stopping.

Deep nets have so many parameters that regularization isn't optional.

Weight decay penalizes θ2\|\theta\|^2, equivalent to L2L_2 regularization. It's almost always worth using; modern optimizers like AdamW separate it from the gradient update for cleaner behavior.

Dropout randomly zeros out a fraction of activations during training, forcing the network not to rely on any single neuron. It's roughly equivalent to ensembling many subnetworks. Use lower dropout in convolutional layers, higher (e.g. 0.50.5) in fully-connected ones.

Early stopping monitors validation loss and stops training when it stops improving — a free regularizer that also saves compute.

Data augmentation, batch normalization, and label smoothing are also common regularizers in production deep-learning pipelines.