Power and Effect Size

Designing tests that can actually detect what you care about.

Power, sample size, effect size, and significance level are linked by a single relationship: pick any three and the fourth is determined.

For comparing two means with $n$ per group, the approximate power is

\text{power} \approx \Phi\!\left(\frac{\delta \sqrt{n/2}}{\sigma} - z_{1 - \alpha/2}\right)

where $\delta$ is the true difference and $\sigma$ is the within-group standard deviation. Bigger effects and bigger samples both raise power; bigger noise lowers it.

Underpowered studies are a notorious problem. They miss real effects and, when they do find significance, tend to overstate the effect size — a phenomenon known as the winner's curse.

In trading, this matters viscerally: if you're A/B testing two strategies that each have daily Sharpe of $1$ and you want to detect a true Sharpe difference of $0.5$ at $80\%$ power, you typically need a couple of years of data. Most "I just compared two backtests" exercises are dramatically underpowered.