Andria Systems · Research Methodology

Data Science Applied to
Institutional Capital Flows

Every modelling decision maps to a peer-reviewed academic standard. We replace traditional discretionary analysis with unsupervised machine learning to extract behavioural alpha from 20 years of SEC filings.

116M
Raw Filings
81
Quarters
8,934
Unique Managers
3.4M
CUSIP Mappings

1. Signal Engine: RACS Formula

A composite score synthesising institutional consensus, activist conviction, crowding risk, and macro sensitivity.

RACS = consensus_weightxlog(activist_buyers + 1.1)x(1 - crowding_penalty)x(1 ± regime_weight x prob)
Hover over any term in the formula above to see its mathematical definition and investment rationale.

2. Unsupervised Learning: Manager DNA

Segmentation of 8,934 institutional managers into behavioural archetypes using dimensionality reduction and density clustering.

The 14-Feature Space

Each manager is mapped to a 14-dimensional behavioural vector per quarter. No fundamental or price data is used; only trading behaviour.

portfolio_hhimean_holding_durationturnover_rateactivist_frequencyaum_logn_holdingsmomentum_tiltvalue_tiltsector_concentrationfiling_lag_dayssmall_cap_pctnew_position_rateavg_convictionregime_sensitivity
UMAP Projection
Preserves local and global manifold structure better than t-SNE.
HDBSCAN Clustering
Identifies variable-density clusters without requiring a fixed k. Unclustered points are explicitly labelled as Noise.
Conviction Activists
Index Huggers
Macro Tourists
Nimble Traders

Managers exist in a high-dimensional space defined by 14 behavioural features (turnover, concentration, etc).

3. Macro Intelligence: Gaussian HMM

A 4-state Hidden Markov Model trained on macroeconomic indicators (VIX, yield curve, credit spreads, Fed funds, OFR stress).

Hidden States (Emission Means)
Goldilocks
Low VIX, steep yield curve, tight spreads. Risk-on. RACS amplified +15%.
Recovery
VIX normalising, curve re-steepening. Selective risk-on. RACS amplified +8%.
Rate Shock
Fed hiking, curve flattening/inverting. Duration risk. RACS dampened −12%.
Recession Fear
Elevated VIX, credit spreads blowing out. Defensive. RACS dampened −20%.
Transition Probability Matrix
TO Goldilocks
TO Recovery
TO Rate
TO Recession
FROM Goldilocks
85%
10%
4%
1%
FROM Recovery
15%
75%
8%
2%
FROM Rate
2%
8%
70%
20%
FROM Recession
5%
25%
10%
60%

The matrix shows the learned probability of transitioning from one state to another. Note the high persistence (diagonal) typical of macroeconomic regimes.

4. Statistical Robustness

Based on Bailey et al. (2016). A signal must pass all gates simultaneously to be deployed.

📉1. Deflated Sharpe Ratio
Adjusts for multiple testing & non-normality.
DSR = SR_obs / √(1 + penalty) × √(T)

Penalises the observed Sharpe for the number of configs tested (multiple testing bias), skewness, excess kurtosis, and serial autocorrelation. DSR > 1.0 required.

📊2. Probability of Backtest Overfitting
Combinatorially Symmetric Cross-Validation.
PBO = P(rank(OOS_opt) < 0.5 | IS_opt)

Splits data into 16 partitions (12,870 combinations). Calculates fraction of splits where the in-sample optimal strategy underperforms out-of-sample. PBO < 40% required.

🔄3. Monte Carlo: Bootstrap
Resamples returns with replacement 1000x.
H0: SR_obs ∈ Null Distribution

If observed Sharpe falls in the top 5% of the null distribution (p < 0.05), the signal's performance is not attributable to lucky draws of positive return days.

🌀4. Monte Carlo: Regime Permutation
Randomises HMM labels 1000x.
H0: Regime conditioning adds no value

Shuffles regime labels while keeping returns fixed. Tests whether the regime-conditioning multiplier adds genuine alpha or is merely a post-hoc rationalisation.

Walk-Forward Validation
Fold k: train = [2004_Q1, ... , T_k] test = [T_k + 1Q, ... , T_k + 4Q]

Expanding-window out-of-sample evaluation across 10 folds (2010–2024). No look-ahead bias: the regime model is retrained entirely from scratch in every fold. Transaction costs (5-12 bps) and a realistic 45-day filing lag are strictly enforced.

Factor Attribution
R_p - R_f = α + β_MKT(MKT) + β_SMB(SMB) + β_HML(HML) + β_RMW(RMW) + β_CMA(CMA) + β_MOM(MOM)

OLS regression of RACS portfolio returns against the Fama-French 5-Factor + Momentum model. A statistically significant alpha (t-stat > 2.0) confirms the strategy captures idiosyncratic edge rather than disguised beta.

Academic Foundation

Peer-reviewed papers underpinning the pipeline.

The Probability of Backtest Overfitting
Bailey, Borwein, Lopez de Prado & Zhu (2016)
Journal of Computational Finance
The Deflated Sharpe Ratio: Correcting for Selection Bias...
Bailey & Lopez de Prado (2014)
Journal of Portfolio Management
UMAP: Uniform Manifold Approximation and Projection
McInnes, Healy, & Melville (2018)
arXiv:1802.03426
Density-Based Clustering Based on Hierarchical Density Estimates
Campello, Moulavi, & Sander (2013)
PAKDD 2013
A Five-Factor Asset Pricing Model
Fama & French (2015)
Journal of Financial Economics