Support Vector Machines

Maximum-margin classifiers and the kernel trick.

An SVM finds the linear decision boundary that maximizes the margin between two classes. For linearly separable data:

\min_{w, b} \frac{1}{2}\|w\|^2 \quad \text{s.t.} \quad y_i(w^\top x_i + b) \ge 1

The dual formulation depends only on inner products $\langle x_i, x_j \rangle$ — the kernel trick replaces them with a kernel function $K(x_i, x_j)$ to fit non-linear boundaries (RBF, polynomial, sigmoid).

For non-separable data, slack variables allow some misclassification at a cost (soft-margin SVM). The trade-off is governed by a hyperparameter $C$ .

SVMs were dominant in the 2000s before being largely supplanted by tree ensembles and deep learning on most problems. They're still used in some structured settings — text classification, anomaly detection, and high-dimensional small-data problems where kernels can capture useful geometry.