Evaluation Gate · Bailey et al. (2016)

Does This Strategy Have
a Real Edge, or Just Luck?

Before any signal reaches deployment, it must pass four independent statistical gates. Each test addresses a different type of false discovery risk: from data leakage and overfitting to pure randomness. All four must pass simultaneously.

Gate PASSED4 / 4 criteria3 Monte Carlo tests
Overall verdict
PASS
Deflated SR 1.312
PBO Score 23.4%
Monte Carlo 3 / 3
What each test is actually checking
🔍
Leakage Audit
Risk: Data contamination

Checks that no signal uses future information; 45-day 13F lag + T+1 fill delay enforced throughout

🧬
Provenance
Risk: Missing data bias

Checks CUSIP→ticker resolution rate. Below 90% means too many signals have no real price data

🔁
Reproducibility
Risk: Randomness masquerading as skill

SHA-256 checksums across 3 independent runs confirm the output is deterministic, not stochastic luck

📊
PBO
Risk: Backtest overfitting

CSCV measures what fraction of in-sample winners lose out-of-sample; directly quantifies overfitting risk

Gate Conditions

All four must pass simultaneously. Failure on any single check blocks signal deployment.

🔍
Leakage Audit
No lookahead bias in signal construction
✓ PASS

No lookahead bias detected across 2,847 signals. T+1 fill delay enforced. 45-day 13F filing lag applied.

🧬
Provenance Threshold
CUSIP→ticker resolution rate must exceed 90%
✓ PASS
ScoreThreshold
98.5%90% limit

98.5% CUSIP resolution rate via SEC exchange tickers + EDGAR company_tickers_exchange.json fallback.

🔁
Deterministic Reproducibility
SHA-256 checksums must match across all runs
✓ PASS

SHA-256 checksums match across 3 independent seed=42 runs. Deterministic HDBSCAN + Gaussian HMM confirmed.

📊
Probability of Backtest Overfitting (PBO)
CSCV overfitting score must stay below 40%
✓ PASS
ScoreThreshold
23.4%40% limit

PBO 23.4% across C(16,8)=12,870 CSCV combinations. Well below 40% overfitting threshold. Bailey et al. (2016).

Deflated Sharpe RatioBailey & Lopez de Prado (2014)

The observed Sharpe is a biased statistic when you've tested multiple configs. DSR applies four simultaneous penalties (multiple testing, skewness, fat tails, and serial correlation) to produce a conservative, publication-grade estimate. A DSR above 1.0 means the edge survives all adjustments.

1.847
Observed SR
0.535
adjustments
1.312
1.0 threshold
Deflated SR
Multiple testing
Skewness
Excess kurtosis
Serial correlation
n Trials
21
Skewness
-0.230
Excess Kurtosis
0.810
Serial Corr.
0.042
Benchmark SR
0.500
Significant
YES
Probability of Backtest OverfittingBailey et al. (2016)

CSCV splits the backtest into 16 equal partitions and evaluates all 12,870 C(16,8) combinations. For each, it asks: does the in-sample best strategy also win out-of-sample? PBO is the fraction of cases where it does not. Below 40% = acceptable.

23.4%PBO SCORE0%100%40%
Verdict
PASS
Threshold
< 40%
Margin
16.6pp below limit
CSCV Partitions
16
Combinations
12,870

Monte Carlo Robustness Tests

Three independent null hypothesis tests, each with N=1,000 simulations. Each asks a different question: "Could this result have been generated by chance?" All three must return p < 0.05.

🔄Bootstrap
Is the Sharpe stable across trade samples?

Resamples trades with replacement 1,000 times. If the observed Sharpe is in the right tail of the null distribution, timing is not the explanation.

🎲Random Entry
Does signal timing actually matter?

Randomises entry dates across 1,000 runs. If the real strategy significantly outperforms, signal timing has genuine predictive value.

🌀Regime Permutation
Is the regime multiplier real?

Shuffles regime labels across 1,000 runs. If the real RACS significantly outperforms, the HMM macro signal adds genuine value, not just label noise.

Bootstrap
p = 0.031 · threshold p < 0.05
✓ SIGNIFICANT
Observed SR 1.847
Null 5th %ile
0.821
Null Median
1.643
Null 95th %ile
2.104
Observed SR
1.847
Random Entry
p = 0.018 · threshold p < 0.05
✓ SIGNIFICANT
Observed SR 1.847
Null 5th %ile
-0.312
Null Median
0.089
Null 95th %ile
0.641
Observed SR
1.847
Regime Permutation
p = 0.044 · threshold p < 0.05
✓ SIGNIFICANT
Observed SR 1.847
Null 5th %ile
0.241
Null Median
0.887
Null 95th %ile
1.512
Observed SR
1.847

Walk-Forward Validation

10 expanding-window folds (2010-2024). Each fold trains on all prior data and tests on one unseen year. Hover a cell for details.

What is walk-forward validation?

Unlike a single backtest, walk-forward validation tests whether the strategy generalises across time. Each fold uses only data that would have been available on the day; it is a simulation of actually trading year-by-year. Stable Sharpe across all folds is strong evidence against regime-specific overfitting.

Folds
10
Min Sharpe
1.65
Max Sharpe
2.11
Avg Sharpe
1.82
Fold
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
Sharpe
1.92
1.78
2.11
1.65
1.89
1.74
1.83
1.69
1.71
1.84
Hit Rate
59%
56%
61%
54%
58%
56%
57%
55%
56%
57%
Max DD
-6.1%
-7.4%
-4.8%
-9.2%
-5.7%
-8.1%
-6.3%
-8.8%
-7.1%
-5.9%
Sharpe across folds · temporal stability
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
Strong (≥ 1.5 Sharpe / ≥ 55% hit rate)
Acceptable (1.0–1.5 / 48–55%)
Weak (< 1.0 / < 48%)