Cross-Validation

Estimating out-of-sample performance honestly.

Cross-validation estimates how a model will perform on unseen data by splitting the available data into training and validation parts. The simplest version is $k$ -fold: divide data into $k$ groups, train on $k-1$ and validate on the held-out group, rotate, and average.

Why bother? In-sample error always underestimates out-of-sample error because the model has been tuned to the data it's evaluated on. Cross-validation removes this leak.

Beware of look-ahead bias and shuffling in time series. For temporal data, use forward-chaining cross-validation (train on past, validate on future) rather than random splits — random splits can leak future information into training.

In quant trading, cross-validation results are necessary but not sufficient: even rigorous CV can produce overconfident estimates due to dataset re-use across many model trials. Walk-forward backtests and out-of-sample lockboxes provide harder-to-fool reality checks.