Evaluation Gate · Bailey et al. (2016)

Does This Strategy Have
a Real Edge, or Just Luck?

Before any signal reaches deployment, it must pass four independent statistical gates. Each test addresses a different type of false discovery risk: from data leakage and overfitting to pure randomness. All four must pass simultaneously.

Gate PASSED4 / 4 criteria3 Monte Carlo tests

Overall verdict

PASS

Deflated SR✓ 1.312

PBO Score✓ 23.4%

Monte Carlo✓ 3 / 3

What each test is actually checking

🔍

Leakage Audit

Risk: Data contamination

Checks that no signal uses future information; 45-day 13F lag + T+1 fill delay enforced throughout

🧬

Provenance

Risk: Missing data bias

Checks CUSIP→ticker resolution rate. Below 90% means too many signals have no real price data

🔁

Reproducibility

Risk: Randomness masquerading as skill

SHA-256 checksums across 3 independent runs confirm the output is deterministic, not stochastic luck

📊

PBO

Risk: Backtest overfitting

CSCV measures what fraction of in-sample winners lose out-of-sample; directly quantifies overfitting risk

Gate Conditions

All four must pass simultaneously. Failure on any single check blocks signal deployment.

🔍

Leakage Audit

No lookahead bias in signal construction

✓ PASS

No lookahead bias detected across 2,847 signals. T+1 fill delay enforced. 45-day 13F filing lag applied.

🧬

Provenance Threshold

CUSIP→ticker resolution rate must exceed 90%

✓ PASS

ScoreThreshold

98.5%90% limit

98.5% CUSIP resolution rate via SEC exchange tickers + EDGAR company_tickers_exchange.json fallback.

🔁

Deterministic Reproducibility

SHA-256 checksums must match across all runs

✓ PASS

SHA-256 checksums match across 3 independent seed=42 runs. Deterministic HDBSCAN + Gaussian HMM confirmed.

📊

Probability of Backtest Overfitting (PBO)

CSCV overfitting score must stay below 40%

✓ PASS

ScoreThreshold

23.4%40% limit

PBO 23.4% across C(16,8)=12,870 CSCV combinations. Well below 40% overfitting threshold. Bailey et al. (2016).

Deflated Sharpe RatioBailey & Lopez de Prado (2014)

The observed Sharpe is a biased statistic when you've tested multiple configs. DSR applies four simultaneous penalties (multiple testing, skewness, fat tails, and serial correlation) to produce a conservative, publication-grade estimate. A DSR above 1.0 means the edge survives all adjustments.

1.847

Observed SR

−0.535

adjustments

→

1.312

1.0 threshold

Deflated SR

Multiple testing

Skewness

Excess kurtosis

Serial correlation

n Trials

Skewness

-0.230

Excess Kurtosis

0.810

Serial Corr.

0.042

Benchmark SR

0.500

Significant

YES

Probability of Backtest OverfittingBailey et al. (2016)

CSCV splits the backtest into 16 equal partitions and evaluates all 12,870 C(16,8) combinations. For each, it asks: does the in-sample best strategy also win out-of-sample? PBO is the fraction of cases where it does not. Below 40% = acceptable.

Verdict

PASS

Threshold

< 40%

Margin

16.6pp below limit

CSCV Partitions

16

Combinations

12,870

Monte Carlo Robustness Tests

Three independent null hypothesis tests, each with N=1,000 simulations. Each asks a different question: "Could this result have been generated by chance?" All three must return p < 0.05.

🔄Bootstrap

Is the Sharpe stable across trade samples?

Resamples trades with replacement 1,000 times. If the observed Sharpe is in the right tail of the null distribution, timing is not the explanation.

🎲Random Entry

Does signal timing actually matter?

Randomises entry dates across 1,000 runs. If the real strategy significantly outperforms, signal timing has genuine predictive value.

🌀Regime Permutation

Is the regime multiplier real?

Shuffles regime labels across 1,000 runs. If the real RACS significantly outperforms, the HMM macro signal adds genuine value, not just label noise.

Bootstrap

p = 0.031 · threshold p < 0.05

✓ SIGNIFICANT

Observed SR 1.847

Null 5th %ile

0.821

Null Median

1.643

Null 95th %ile

2.104

Observed SR

1.847

Random Entry

p = 0.018 · threshold p < 0.05

✓ SIGNIFICANT

Observed SR 1.847

Null 5th %ile

-0.312

Null Median

0.089

Null 95th %ile

0.641

Observed SR

1.847

Regime Permutation

p = 0.044 · threshold p < 0.05

✓ SIGNIFICANT

Observed SR 1.847

Null 5th %ile

0.241

Null Median

0.887

Null 95th %ile

1.512

Observed SR

1.847

Walk-Forward Validation

10 expanding-window folds (2010-2024). Each fold trains on all prior data and tests on one unseen year. Hover a cell for details.

What is walk-forward validation?

Unlike a single backtest, walk-forward validation tests whether the strategy generalises across time. Each fold uses only data that would have been available on the day; it is a simulation of actually trading year-by-year. Stable Sharpe across all folds is strong evidence against regime-specific overfitting.

Folds

10

Min Sharpe

1.65

Max Sharpe

2.11

Avg Sharpe

1.82

Fold

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Sharpe

1.92

1.78

2.11

1.65

1.89

1.74

1.83

1.69

1.71

1.84

Hit Rate

59%

56%

61%

54%

58%

56%

57%

55%

56%

57%

Max DD

-6.1%

-7.4%

-4.8%

-9.2%

-5.7%

-8.1%

-6.3%

-8.8%

-7.1%

-5.9%

Sharpe across folds · temporal stability

2015

2016

2017

2018

2019

2020

2021

2022

2023

2024

Strong (≥ 1.5 Sharpe / ≥ 55% hit rate)

Acceptable (1.0–1.5 / 48–55%)

Weak (< 1.0 / < 48%)

Does This Strategy Havea Real Edge, or Just Luck?

Gate Conditions

Monte Carlo Robustness Tests

Walk-Forward Validation

Does This Strategy Have
a Real Edge, or Just Luck?