These notes are freely taken from the book Machine Learning for Asset Managers (Elements in Quantitative Finance). They are often incomplete and not 100% accurate, as I intentionally skipped mathematical proofs and focused mostly on the practical concepts.

Index

Introduction
Denoising and Detoning
Distance Metrics
Optimal Clustering
Financial Labels
Feature Importance Analysis
Portfolio Construction
Testing Set Overfitting
Bibliography

1. Introduction

Why Not to Do Backtesting

Contrary to popular belief, backtesting is not a research tool.
Backtests can never prove that a strategy is a true positive.

Never develop a strategy just because of backtests.
Strategies must be supported by theory, not only historical simulations.

Your theories must be general enough to explain particular cases, including extreme events (black swans).

Role of Machine Learning (ML)

ML helps discover hidden variables.
ML itself does not reveal the relations between variables. Therefore:
- Formulate a theory that binds the elements together.
- Test the theory in different contexts (even where ML found no signal).

Once the theory has been tested, it should stand on its own.
The theory, not the ML algorithm, should make predictions.

Uses of ML

Existence: ML can indicate the presence of previously unknown relations by predicting outcomes.
Importance: Feature-importance measures reveal which variables matter more for model performance.
Causation: ML can support causal inference by comparing predictions.
Reductionism: Dimensionality reduction can reveal low-dimensional structure and clusters.
Retrieval: ML scans large datasets to find rare events.
Outlier Detection: ML identifies anomalies using learned patterns.

Types of Overfitting

Train-set overfitting: model fits noise and signal in the training data (learns tiny irrelevant movements).
Test-set overfitting: repeatedly testing many models on the same test set and selecting the best-performing one.

A useful diagnostic is the generalization gap:

Generalization Gap G = E_{test} - E_{train},

where $E_{train}$ and $E_{test}$ are the expected errors (or loss) on training and test sets respectively. A large positive $G$ indicates overfitting.

Overfitting Diagram

2. Denoising and Detoning

Covariance Matrix

Covariance matrices are empirically obtained, so they contain a significant amount of noise.
If used directly (e.g., for portfolio optimization), this noise can make results unstable.

The Marčenko–Pastur Theorem

The theorem tells us what the eigenvalues of a covariance (or correlation) matrix should look like if the data is completely random.

If we compute the covariance matrix of asset returns, some of the eigenvalues may just be noise. (random correlations that don’t carry information).

Consider a matrix $X \in R^{T \times N}$ with:

$T$ observations (rows)
$N$ assets (columns)

Assume returns have mean $0$ and variance $σ^{2}$ . Compute the sample covariance matrix:

C = \frac{1}{T} X^{⊤} X

Each eigenvalue $λ$ represents how much variance is explained by a certain “direction” (or factor) in the data.

If the data is purely random, the eigenvalues follow a known distribution: the Marčenko–Pastur distribution.

The theorem tells us that the eigenvalues will lie between two limits:

λ_{-} = σ^{2} (1 - N / T)^{2}, λ_{+} = σ^{2} (1 + N / T)^{2}

All eigenvalues between $λ_{-}$ and $λ_{+}$ are what we expect from random data, they are just noise .
Eigenvalues larger than $λ_{+}$ (or smaller than $λ_{-}$ ) may represent true correlations or factors in the data.

The basic Marčenko–Pastur theorem assumes that all eigenvalues are random.

In real financial data, this assumption is not valid, some eigenvalues capture true, non-random structure, such as market factor (the first principal component, representing systematic market risk), sector effects or other common risk factors.

The Solution: Laloux Adjustment

Laloux proposed an approach to adjust the Marčenko–Pastur model to account for the presence of non-random eigenvalues.

Identify non-random eigenvalues, typically the largest eigenvalues (e.g., the first one representing the market mode).
Remove their contribution from the variance
- Since these are not random, they should not increase the estimated noise level.
- Adjust the variance:
$σ_{adjusted}^{2} = σ^{2} (1 - \frac{λ _{+}}{N})$
Recompute Marčenko–Pastur bounds
- Use the adjusted variance to compute new limits:
$λ_{-} = σ_{adjusted}^{2} (1 - \frac{N}{T})^{2}, λ_{+} = σ_{adjusted}^{2} (1 + \frac{N}{T})^{2}$
Fit the theoretical PDF
- Fit the Marčenko–Pastur distribution wit updated data to estimate how much variance is explained by random noise.

Denoising

When you build a correlation matrix from asset returns, a lot of its eigenvalues come from noise.

If you use this noisy matrix directly (e.g. in portfolio optimization), the results are unstable and misleading.

So, the idea is to denoise the matrix: keep the information from the meaningful eigenvalues (the big ones above the Marčenko–Pastur threshold) and replace the noisy ones with something cleaner.

1. Decomposition of the denoised correlation matrix

Start from the denoised correlation matrix $C_{1}$ with its eigen-decomposition:

C_{1} = W Λ W^{⊤},

where $W = [w_{1}, \dots, w_{N}]$ (orthonormal eigenvectors) and $Λ = diag (λ_{1}, \dots, λ_{N})$ .

Partition the eigenpairs into market components (hopefully justone, typically the largest eigenvalue(s)) and non-market (detoned) components.

W = [W_{M} ∣ W_{D}],

so that

C_{1} = W_{M} Λ_{M} W_{M}^{⊤} + W_{D} Λ_{D} W_{D}^{⊤} .

2. Remove the market (detoning)

Subtract the market part to get the detoned correlation matrix $C_{2}$ :

C_{2} = C_{1} - W_{M} Λ_{M} W_{M}^{⊤} = W_{D} Λ_{D} W_{D}^{⊤} .

Because we removed at least one eigenvector, $C_{2}$ is singular (it has at least one zero eigenvalue).

Rescale to enforce unit diagonal if you need a correlation matrix:

C_{2} \to diag (C_{2})^{- 1/2} C_{2} diag (C_{2})^{- 1/2} .

3. Portfolio optimization in the reduced (principal-component) space

You cannot directly invert $C_{2}$ . Instead optimize on the non-zero principal components (the detoned subspace) and then map allocations back to the original assets.

Write a portfolio as a linear combination of the surviving eigenvectors:

ω = W_{+} f,

where $W_{+}$ is the matrix of eigenvectors that survived detoning (i.e. the columns of W_D), and $f$ is the vector of allocations on those principal components.

Take the classic mean–variance objective (risk aversion $γ$ ):

ω max {μ^{⊤} ω - \frac{γ}{2} ω^{⊤} C_{1} ω} .

Substitute $ω = W_{+} f$ . Using orthonormality and the block decomposition, the objective in $f$ becomes

f max {μ^{⊤} W_{+} f - \frac{γ}{2} f^{⊤} Λ_{+} f},

where $Λ_{+}$ is the diagonal matrix of the non-zero eigenvalues (the same as $Λ_{D})$ .

The first-order optimality condition yields the closed form

f^{*} = \frac{1}{γ} Λ_{+}^{- 1} W_{+}^{⊤} μ .

Finally map back to original asset weights:

ω^{*} = W_{+} f^{*} = \frac{1}{γ} W_{+} Λ_{+}^{- 1} W_{+}^{⊤} μ

Note: because W_+ columns are orthonormal, $W_{+}^{⊤}$ is the left-inverse of $W_{+}$ . This expression avoids inverting the full (singular) $C_{2}$ and only requires inverting the diagonal $Λ_{+}$ .

What is $f$ ?

So $f_{i}$ tells you how much you are investing in eigenportfolio i (the eigenportfolio built from eigenvector $w_{i}$ ).

After detoning (removing market eigenvectors), we only keep the “surviving” eigenvectors $W_{+}$ . Instead of trying to optimize in the full asset space, we solve for $f$ in the reduced space:

f^{*} = \frac{1}{γ} Λ_{+}^{- 1} W_{+}^{⊤} μ

Then we map back to original asset weights:

ω^{*} = W_{+} f^{*}

This way we only allocate risk to components that carry non-noise information.

Shrinkage

Shrinkage is a technique to improve the estimation of covariance (or correlation) matrices when data is noisy or limited.

Empirical covariance matrices, estimated from historical returns, are often unstable or ill-conditioned (especially if you have many assets and few observations). So, instead of using the raw covariance matrix $C_{empirical}$ , combine it with a target $C_{target}$ that is more stable (like the identity matrix or an average correlation matrix).

Formally:

C_{shrunk} = α C_{target} + (1 - α) C_{empirical}

$α \in [0, 1]$ = shrinkage intensity.

$α = 0$ : no shrinkage (use raw covariance)
$α = 1$ : full shrinkage (use only the target)

3.Distance Metrics

Correlation

Correlation $ρ [X, Y]$ measures linear codependence but is not a metric (fails nonnegativity and triangle inequality).

We therefore need a metric that gives a geometry/topology to compare and cluster data, so define:

d_{ρ} [X, Y] = \frac{1}{2} (1 - ρ [X, Y])

$d_{ρ} [X, Y]$ standardize X,Y into zero-mean, unit-variance vectors x,y.

If we look at the Euclidean distance:

d [x, y] = t = 1 \sum T (x_{t} - y_{t})^{2}

it expands to:

d [x, y] = 2 T (1 - ρ [x, y]) = 4 T d_{ρ} [X, Y]

So $d_{ρ}$ is proportional to a Euclidean distance, hence inherits metric properties.

$d_{ρ} \in [0, 1]$ since $ρ \in [- 1, 1]$ .

Positively correlated show small distance, negatively correlated large distance.
This makes sense in long-only portfolios (negatively correlated assets act as diversifiers).

Absolute Correlation

Sometimes, in long-short portfolios, we want negatively correlated assets to be considered similar.

So we can define absolute correlation by:

d_{∣ ρ ∣} [X, Y] = 1 - ∣ ρ [X, Y] ∣

That is still a metric but highly negatively correlated assets have small distance.

Problems of Correlation

Correlation quantifies the linear codependecyof two assets
It’s highly influenced by outliers
Its application beyond the multivariate Normal case is questionable (In the Gaussian world correlation is a clean, sufficient measure of dependence, outside it correlation can be unstable, incomplete, or misleading).

So better to introduce other metrics.

Entropy of a Discrete Random Variable

Shannon entropy measures the average amount of information needed to describe the outcome of a random variable or the uncertainty or diversification

For random variable $X$ with support $S_{X}$ and pmf $p (x)$ :

H (X) = - x \in S_{X} \sum p (x) lo g p (x)

with the convention that: $0 lo g 0 := 0$ (limit argument).

If event $x$ has probability $p (x)$ , its “surprise” is $\frac{1}{p ( x )}$ , so low probability brings high surprise.

Entropy can be interpreted as the expected surprise: it's the average uncertainty in $X$ .

When all probability mass is on a single outcome, there's no uncertainty and $H (x) = 0$ . The Maximum entropy is reached when $X$ is uniform over its support and has

H (X) = lo g ∣ S_{X} ∣

Joint Entropy

For random variables $X$ , $Y$ with joint pmf $p (x, y)$ :

H (X, Y) = - x \in S_{X} \sum y \in S_{Y} \sum p (x, y) lo g p (x, y)

It measures the uncertainty of the pair $(X, Y)$ considered together.

Propertries:

$H (X, Y) = H (Y, X)$
$H (X, X) = H (X)$
$H (X, Y) \geq max {H (X), H (Y)}$
$H (X, Y) \leq H (X) + H (Y)$

(equality holds if $X$ and $Y$ are independent.)

Conditional Entropy

The conditional entropy of $X$ given $Y$ is:

H (X ∣ Y) = y \in S_{Y} \sum p (y) x \in S_{X} \sum p (x ∣ y) lo g \frac{1}{p ( x ∣ y )} = - y \in S_{Y} \sum p (y) x \in S_{X} \sum p (x ∣ y) lo g p (x ∣ y)

$H (X ∣ Y)$ measures the remaining uncertainty about X once Y is known.

Conditional entropy is always ≤ entropy:

H (X ∣ Y) \leq H (X)

Knowing $Y$ can never increase the uncertainty about X.
Equality occurs if $X$ and $Y$ are independent ( $Y$ gives no info about $X$ ).
Entropy of a variable given itself:

H (X ∣ X) = 0

If you already know X, there’s no uncertainty left.

Chain rule for entropy:

H (X, Y) = H (Y) + H (X ∣ Y) = H (X) + H (Y ∣ X)

Total uncertainty about $(X, Y)$ can be decomposed into uncertainty of $Y$ and residual uncertainty of $X$ after knowing $Y$ .

Kullback–Leibler Divergence

For two discrete probability distributions $p$ and $q$ defined on the same set $S_{X}$ :

D_{K L} (p ∥ q) = x \in S_{X} \sum p (x) lo g \frac{p ( x )}{q ( x )}

$p (x)$ = true probability of x
$q (x)$ = reference or approximate probability of $x$
Condition: if $q (x) = 0 \Rightarrow p (x) = 0$ to avoid division by zero
KL divergence measures how much $p$ diverges from $q$ .
It quantifies the “extra information” required if we encode data using $q$ instead of the true distribution $p$ .

Propertries

$D_{K L} (p ∥ q) \geq 0$ Equality occurs only when $p = q$ .
$D_{K L} (p ∥ q) \neq = D_{K L} (q ∥ p)$ (Order matters, so KL divergence is not a metric)
Triangle inequality fails, so KL divergence does not satisfy all properties of a distance metric.
If $q (x) = 1/∣ S_{X} ∣$ (uniform distribution), $D_{K L} (p ∥ q)$ measures how far p is from uniform.

Cross Entropy

For two discrete probability distributions $p$ (true distribution) and $q$ (approximation or model) defined on the same set $S_{X}$ :

H (p, q) = - x \in S_{X} \sum p (x) lo g q (x)

$p (x)$ = true probability of outcome x
$q (x)$ = probability assigned by a model or reference distribution
It measure the uncertainty of $X$ when using $q$ instead of the true $p$ .

Entropy: $H (p) = - \sum p (x) lo g p (x)$ measures uncertainty using the true distribution.

Cross-entropy: $H (p, q) = H (p) + D_{K L} (p ∥ q)$

Cross-entropy is always greater than or equal to true entropy.
The difference $H (p, q) - H (p) = D_{K L} (p ∥ q)$ measures how “wrong” $q$ is.

Mutual Information

Mutual information measures how much knowing one variable reduces uncertainty about another.

I (X; Y) = H (X) - H (X ∣ Y)

$H (X)$ = entropy of $X$ (uncertainty about $X$ )
$H (X ∣ Y)$ = conditional entropy of $X$ given $Y$ Equivalently:

I (X; Y) = x \in S_{X} \sum y \in S_{Y} \sum p (x, y) lo g \frac{p ( x , y )}{p ( x ) p ( y )}

Propertries

Non-negative: $I (X; Y) \geq 0$ $M I = 0$ if $X$ and $Y$ are independent (knowing $Y$ gives no info about $X$ )
Symmetric: $I (X; Y) = I (Y; X)$
Upper bound: $I (X; Y) \leq min {H (X), H (Y)}$
Not a metric: Does not satisfy triangle inequality
Grouping / decomposition: $I (X; Y; Z) = I (X; Y) + I ((X, Y); Z)$

Mutual information can be expressed as a KL divergence:

I (X; Y) = D_{K L} (p (x, y) ∥ p (x) p (y))

It measures >how far the joint distribution is from independence.

4. Optimal Clustering

Clustering groups objects (items) by similarity using their features. The goal is to produce clusters where intra-cluster similarity is high and inter-cluster similarity is low. Clustering is a form of unsupervised learning because we do not provide labelled examples.

Proximity matrix

A proximity matrix measures how "close" or "similar" objects in a dataset are to each other.

For a dataset with $n$ elements, the proximity matrix $P$ is an $n \times n$ matrix.
Each entry $P [i, j]$ represents the closeness between object $i$ and object $j$ .

Typical definitions:

$P [i, j] = similarity (x_{i}, x_{j})$
$P [i, j] = distance (x_{i}, x_{j})$

A symmetric proximity matrix with zeros on the diagonal is common when using pairwise distances.

Two main families of clustering:

Hierarchical clustering

Produces a nested sequence of partitions (a dendrogram).
Top: a single cluster containing all points. Bottom: each point alone.
Two approaches:
- Agglomerative (bottom-up): start with a lot of partitions and merge.
- Divisive (top-down): start with one cluster and split.
No need to specify $K$ up-front (but you choose a cut in the dendrogram to get $K$ clusters).

Partitional clustering

Partitions the dataset into $K$ disjoint clusters.
Each object belongs to exactly one cluster.
Example: k-means, k-medoids, spectral clustering.
Often requires the number of clusters $K$ as input.

Practical issues

Curse of dimensionality: when the number of features is large relative to observations, distances become less informative and clustering can degrade. Dimensionality reduction (PCA, t-SNE, UMAP) or feature selection helps.
Initialization randomness: algorithms like k-means are sensitive to initial centroids.

Because partitional methods require $K$ , several methods exist to estimate a good $K$ :

Elbow method

Compute within-cluster sum of squared errors (WCSS) for different $K$ .
Plot WCSS vs $K$ . The "elbow" point (where marginal gain drops) is a reasonable $K$ .

Silhouette analysis

Measures how well each point lies within its cluster compared to the nearest other cluster.

For an observation $i$ :

Let $a_{i}$ = average distance from $i$ to all other points in the same cluster (intra-cluster distance).
Let $b_{i}$ = minimum over other clusters of the average distance from $i$ to points in that other cluster (nearest inter-cluster distance).

The silhouette coefficient for $i$ is

S_{i} = \frac{b _{i} - a _{i}}{max ( a _{i} , b _{i} )}

$S_{i} \approx 1$ : well clustered (close to own cluster, far from others).
$S_{i} \approx 0$ : on or near cluster boundary.
$S_{i} < 0$ : likely misclassified (closer to another cluster).

The average silhouette score across all points is a scalar summary used to compare different $K$ : higher is better.

Practical checklist for clustering

Preprocess
- Standardize / normalize features when distances are used.
- Handle missing values.
- Reduce dimensionality if needed.
Choose distance / similarity
- Euclidean for continuous, cosine for directional data, Jaccard for binary sets, etc.
Choose algorithm
- Hierarchical if you want a dendrogram and no fixed $K$ .
- K-means/k-medoids for large datasets with spherical-ish clusters.
Select $K$
- Use elbow, silhouette, gap statistic, or domain knowledge.
Validate
- Visual inspection (2D/3D embedding).
- Cluster compactness and separation metrics (silhouette, Davies–Bouldin).
- If possible, external validation against known labels.

Once we run a clustering algorithm (e.g., k-means), we still need to evaluate how good the clustering result is.

Quality Measure $q$

Define the overall clustering quality:

q = \frac{E [ S _{i} ]}{V [ S _{i} ]}

Where:

$E [S_{i}]$ = mean of silhouette scores (average clustering quality).
$V [S_{i}]$ = variance of silhouette scores (consistency across points).

Higher $q$ means better clustering (high average silhouette, low variability).

Base Clustering Algorithm

At the base level, the procedure works as follows:

Input
- Start from the observation (distance/similarity) matrix.
Outer Loop (over $K$ )
- For $k = 2, 3, \dots, N$ :
  - Run k-means clustering with $k$ clusters.
  - Compute the quality measure $q$ from silhouette scores.
  - Store results for that $k$ .
Inner Loop (over initializations)
- Repeat the above with different random initializations of centroids.
- This reduces sensitivity to random starts.
Selection
- Choose the clustering result with the highest quality $q$ .
- This yields:
  - Optimal number of clusters $K$ .
  - Best initialization (most stable clustering result).

Even after selecting the best $K$ and initialization, some clusters can still be poor quality.

Reclustering Low-Quality Clusters

To refine the clustering, we recluster only the bad clusters.

Start with Base Clustering Result
- We already have $K$ clusters from the base algorithm.
- For each cluster $k$ , compute its quality score $q_{k}$ (mean silhouette score of its members).
Identify Poor-Quality Clusters
- Compute overall average quality: $q = \frac{1}{K} k = 1 \sum K q_{k}$
- Identify clusters with $q_{k} < q$ → “below-average quality.”
- Let $K_{1}$ = number of such clusters.
Decision Rule Based on $K_{1}$

If $K_{1} \leq 1$ :
- No meaningful reclustering, so return base clustering.
If $K_{1} \geq 2$ :
- Recluster only those $K_{1}$ clusters :
  - Build a new observation matrix for elements of these clusters.
  - Rerun the base algorithm (search over $K$ and initializations) on this reduced subset.

Check Improvement
- Compare average quality before vs. after reclustering.
- If quality improved:
  - Merge the unchanged good clusters with the newly reclustered clusters.
- Else:
  - Keep the original base clustering result.

5. Financial Labels

Supervised learning aims to predict an output $y$ given features $X$ .

Types of Supervised Learning

Regression problems
- Predict a continuous target (real numbers).
- Examples drawn from infinite population (countable like integers or uncountable like reals).
Classification problems
- Predict a discrete label from a finite set (e.g., $0, 1$ , $b e a r i s h, b u ll i s h$ ).

Fixed-Horizon Labeling

We have a feature matrix:

$X = {X_{i}}_{i = 1}^{I}$
For each observation $X_{i}$ :
- Let $t_{i, 0}$ = index of bar where features are sampled.
- Define a fixed horizon $h$ .
- Compute return over horizon $h$ :

r_{t_{i, 0}; t_{i, 1}} = \frac{p _{t_{i, 1}}}{p _{t_{i, 0}}} - 1, where t_{i, 1} = t_{i, 0} + h

Assign label $y_{i} \in {- 1, 0, 1}$ based on return:

y_{i} = ⎩ ⎨ ⎧ - 1 0 + 1 if r_{t_{i, 0}; t_{i, 1}} < - τ if ∣ r_{t_{i, 0}; t_{i, 1}} ∣ \leq τ if r_{t_{i, 0}; t_{i, 1}} > τ

Where:

$τ$ = constant return threshold.
Time bars = bars sampled at regular time intervals.
Result: fixed time horizons for all samples.

Concerns with Fixed-Horizon Labeling

Returns show intraday volatility patterns (open/close more volatile).
Constant $τ$ transfers seasonality into labels, label distribution becomes non-stationary.

Solution: use tick/volume/dollar bars instead of time bars or standardize returns using volatility estimate:

z_{t_{i, 0}; t_{i, 1}} = \frac{r _{t_{i, 0}; t_{i, 1}}}{σ _{t_{i, 0}; t_{i, 1}}}

where $σ_{t_{i, 0}; t_{i, 1}}$ = predicted volatility. Then apply labeling rule to $z$ -scores.

Triple-Barrier Method

Labels should represent actual trading outcomes: profit, loss, or timeout.

This method sets three barriers:

Upper Horizontal Barrier: Profit-taking (success).
Lower Horizontal Barrier: Stop-loss (failure).
Vertical Barrier: Maximum holding period (timeout).

Step-by-Step Procedure

Start at $t_{i, 0}$ (where features $X_{i}$ observed).
Monitor price path forward until one barrier is hit.
Assign label $y_{i}$ :
- If upper barrier reached first: $y_{i} = + 1$ .
- If lower barrier reached first: $y_{i} = - 1$ .
- If vertical barrier reached first:
  - Option 1: $y_{i} = 0$ (neutral).
  - Option 2: $y_{i} = sgn (r_{t_{i, 0}; t_{i, 1}})$ (final return sign).

Trend Scanning Method

Instead of setting a fixed horizon $h$ , profit-taking, or stop-loss levels, trend scanning lets trends run naturally until they end.

Label each observation as part of:

Uptrend: $y_{t} = + 1$
Downtrend: $y_{t} = - 1$
No trend: $y_{t} = 0$

Define a Trend

Fit a local linear regression over a window of $L$ observations:

x_{t + l} = β_{0} + β_{1} l + ε_{t + l}, l = 0, 1, \dots, L - 1

$β_{1}$ = slope, direction & strength of trend.
$ε_{t + l}$ = residuals (noise).

Test Statistical Significance

Compute t-statistic of the slope:

\hat{t}_{β_{1}} = \frac{β ^ _{1}}{σ ^ _{\hat{β}_{1}}}

Large $∣ \hat{t}_{β_{1}} ∣$ : strong trend evidence.
Near 0: no clear trend.

Assign Labels

For each $t$ :

y_{t} = ⎩ ⎨ ⎧ + 1 - 1 0 if \hat{t}_{β_{1}} > τ (Uptrend) if \hat{t}_{β_{1}} < - τ (Downtrend) if ∣ \hat{t}_{β_{1}} ∣ \leq τ (No trend)

Where $τ$ = critical t-value threshold.

Meta-Labeling

Meta-labeling is a two-model approach:

Primary model: predicts position direction (long/short).
Secondary (meta) model: predicts whether that position will be profitable (filtering false positives).

The goal is to reduce exposure to losing trades and size positions proportionally to predicted success probability.

Expected Value & Risk

Let $p$ = probability of profit and $π$ = payoff magnitude.

The expected value is:

μ = p π + (1 - p) (- π) = π (2 p - 1)

$p > 0.5 \Rightarrow μ > 0$ : positive expected profit.
$p = 0.5 \Rightarrow μ = 0$ : no edge.

Expected Variance

σ^{2} = 4 π^{2} p (1 - p)

Highest when $p = 0.5$ (most uncertain).
Shrinks as $p \to 0$ or $1$ (certainty increases).

Sharpe Ratio

Standardize risk/return with Sharpe-like measure:

z = \frac{μ}{σ} = \frac{π ( 2 p - 1 )}{2 π p ( 1 - p )} = \frac{2 p - 1}{2 p ( 1 - p )}

$z > 0$ : attractive opportunity.
$z < 0$ : avoid trade.

Mapping to Position Size

Convert $z$ to position size $m \in [- 1, 1]$ :

m = 2 Z (z) - 1

Where $Z (\cdot)$ = CDF of standard normal.

$z = 0 \Rightarrow m = 0$ (no trade)
Large $z > 0 \Rightarrow m \to 1$ (max long)
Large $z < 0 \Rightarrow m \to - 1$ (max short)

Meta-Labeling for Position Size with Multiple Classifiers

Suppose we have $n$ meta-labeling classifiers:

y_{i} \in {0, 1}, i = 1, \dots, n

Sum of predictions:

i = 1 \sum n y_{i} \sim Binomial (n, p)

Mean: $E [\sum y_{i}] = n p$
Variance: $Va r [\sum y_{i}] = n p (1 - p)$

As $n \to \infty$ , by de Moivre–Laplace :

\frac{\sum y _{i} - n p}{n p ( 1 - p )} d N (0, 1)

Thus, average prediction:

\overset{p}{^} = \frac{1}{n} \sum y_{i} \sim N (p, \frac{p ( 1 - p )}{n})

Hypothesis Test

Null hypothesis: $H_{0} : p = 0.5$ (no predictive edge)

t-statistic:

t = \frac{p ^ - 0.5}{\frac{p ^ ( 1 - p ^ )}{n}}

Follows Student-t distribution with $n - 1$ d.o.f.

Mapping t to Bet Size

Convert $t$ to bet size $m \in [- 1, 1]$ :

m = 2 T_{n - 1} (t) - 1

Where $T_{n - 1} (\cdot)$ = CDF of Student-t distribution.

$t = 0 \Rightarrow m = 0$ (no bet)
Large $t > 0 \Rightarrow m \to 1$ (max long)
Large $t < 0 \Rightarrow m \to - 1$ (max short)

In practical:

$\overset{p}{^} \to 1$ : strong long conviction → large $m > 0$ .
$\overset{p}{^} \to 0$ : strong short conviction → large $m < 0$ .
$\overset{p}{^} \approx 0.5$ : no edge → $m \approx 0$ (no position).

6. Feature Importance Analysis

P-Values (Statistical Significance)

Probability of observing the estimated coefficient if the true coefficient = 0 .

It tests whether a feature has any effect, not its predictive power.

The p-value is a probability, but not the probability that your hypothesis is true. It answers a very specific question:

“If the null hypothesis were true, how likely is it to observe a result as extreme (or more extreme) than what I actually observed?”

Limitations

False positives/negatives if model assumptions fail.
Not robust with multicollinearity (correlated features).
Measures statistical relevance, not:
- Effect size
- Predictive power
- Causality

A low p-value doesn't mean feature is “important”, it only means the effect is statistically detectable.

Mean Decrease Impurity (MDI) — Tree-Based Models

A clear video explanation can be found here, credits to StatQuest

At each node pick feature $X_{f}$ and threshold $τ$ to split:
- Left: $X_{f} < τ$
- Right: $X_{f} \geq τ$
Measure impurity with e.g.:
- Entropy: $H (t) = - \sum_{c} p_{c} lo g_{2} p_{c}$
- Gini: $G (t) = 1 - \sum_{c} p_{c}^{2}$
Information gain (impurity reduction) for a split: $Δ g (t, f) = i (t) - \frac{N ( t _{0} )}{N ( t )} i (t_{0}) - \frac{N ( t _{1} )}{N ( t )} i (t_{1})$
Feature importance (MDI) for feature $f$ : sum of weighted $Δ g (t, f)$ over all nodes where $f$ was used.

Mean Decrease Accuracy (MDA)

This algorithm measures out-of-sample importance.

Fit model and compute cross-validated performance (baseline).
For each feature:
- Shuffle that feature's values (breaks relationship with target).
- Recalculate cross-validated performance.
$M D A$ = performance(before shuffle) − performance(after shuffle).

Interpretation

Large positive MDA means that feature important.
MDA near zero means that feature irrelevant.
Negative MDA means feature damaging performance (model improved after shuffling).

Model Performance Metrics

Negative Average Likelihood (NegAL)

NegAL = \frac{1}{N} n = 1 \sum N k = 1 \sum K y_{n, k} p_{n, k}

where $y_{n, k} = 1$ if observation $n$ has true label $k$ , else 0; $p_{n, k}$ is predicted prob.

Range: $0 \leq NegAL \leq 1$ .
Higher =is better (closer to 1 the model assigns higher probability to true labels).

Probability-Weighted Accuracy (PWA)

PWA = \frac{1}{N} n = 1 \sum N y_{n} p_{n}

where $y_{n} = 1$ if prediction correct else 0; $p_{n} = max_{k} p_{n, k}$ .

It weighs correct predictions by model confidence.

Feature Clustering for Importance

With highly codependent features the individual importance unreliable, so it's commonoto cluster features to analyze them better.

Represent features in a metric space
- Options:
  - Correlation-based distances (linear dependencies).
  - Information-theoretic metrics (e.g., Variation of Information) to capture nonlinear redundancies.
Cluster formation
- Apply an ONC (Optimal Number of Clusters) algorithm:
  - Finds number of clusters and assignments.
  - Partitional assignment: each feature belongs to exactly one cluster.
  - Within-cluster features share much information; across clusters share little.
Optional residualization
- For cluster $k$ , regress each feature $X_{n, i} \in D_{k}$ on features from earlier clusters $\cup_{l < k} D_{l}$ : $X_{n, i} = α_{i} + j \in \cup_{l < k} D_{l} \sum β_{i, j} X_{n, j} + ε_{n, i}$
- Replace $X_{i}$ with residuals $\overset{ε}{^}_{i}$ .
- If regressors are too many, reduce dimensionality inside clusters.

Clustered Importance

Apply MDI or MDA at the cluster level (not single-feature).
Interpret which groups of features drive model performance.

7. Portfolio Construction

Given $N$ assets the expected excess returns $μ$ (vector $N \times 1$ ) and the covariance matrix $V$ ( $N \times N$ ), the goal is to find portfolio weights $ω$ that minimize portfolio variance .

Classical Mean–Variance Optimization (Variance Minimization)

ω min \frac{1}{2} ω^{⊤} Vω s.t. ω^{⊤} a = 1

The vector $a$ encodes the portfolio constraint (e.g., fully invested: $a = 1_{N}$ ).

Lagrangian:

L (\frac{1}{2} ω, λ) = \frac{1}{2} ω^{⊤} Vω - λ (ω^{⊤} a - 1)

Optimal weights:

ω^{*} = \frac{V ^{- 1} a}{a ^{⊤} V ^{- 1} a}

Useful Special Cases

Equal weights under isotropic variance: $a = 1_{N}, V = σ^{2} I_{N}$ , $ω^{*} = \frac{1}{N} 1_{N}$
Inverse-variance : $a = 1_{N}, V$ diagonal, $ω_{n}^{*} \propto \frac{1}{V _{n, n}}$
Minimum-variance: $a = 1_{N}$ , $ω^{*} = \frac{V ^{- 1} 1 _{N}}{1 _{N}^{⊤} V ^{- 1} 1 _{N}}$
Maximum Sharpe : $a = μ$ , $ω^{*} = \frac{V ^{- 1} μ}{1 _{N}^{⊤} V ^{- 1} μ}$

Numerical Stability & Conditioning Problem

Let $C$ be the standardized correlation matrix of $V$ . The solution needs to calculate on $V^{- 1}$ .

If assets are highly correlated ( $∣ ρ ∣ \approx 1$ ), then $C$ is ill-conditioned:

Condition number large means that $V^{- 1}$ numerically unstable (small errors in $V$ estimate produce large swings in $V^{- 1}$ ). We need Markowitz when correlations exist, but those same correlations make the solution unstable.

Practical fixes :

Shrinkage estimators (Ledoit–Wolf)
Factor models (lower-dimensional structure)
Regularization: add $δ I$ to $V$ before inversion ( $V + δ I$ )
Robust/resampled optimization, constraints on weights (box, turnover), or use of priors (Black–Litterman)

Nested Clustered Optimization (NCO)

The goal is tos tabilize Markowitz by solving smaller, better-conditioned subproblems.

Cluster the correlation matrix to find groups of highly correlated assets.
Intracluster optimization:
- For each cluster, compute intracluster weights using the denoised covariance (call it cov1).
- Smaller cluster size means better-conditioned covariance estimates.
- With very correlated cluster, the min-variance tends toward equal weights, so instability is limited.
Intercluster optimization :
- Build reduced covariance (cov2) between clusters.
- cov2 is by construction close to diagonal; invertibility and conditioning are much improved.
Combine:
- Final asset weights = (intra-cluster weights) × (inter-cluster weights).

Why NCO helps

Dimensionality reduction: each inversion is on smaller matrices.
Denoising at step 1 reduces noisy eigenvalues.
Aggregation produces a near-diagonal inter-cluster covariance, improving stability.

8. Testing Set Overfitting

Selection Bias under Multiple Testing (SBuMT)

SBuMT occurs when a researcher tests many strategies on the same historical dataset and then reports only the best-performing one.

This process leads to false discoveries , strategies that look profitable in-sample but fail in live trading.

True vs. False Investment Strategies & Precision/Recall

The total number of strategies is

s = s_{T} + s_{F}

where:

$s_{T}$ = number of true strategies (positive expected return)
$s_{F}$ = number of false strategies (zero or negative expected return)

The Odds ratio of true to false strategies is:

θ = \frac{s _{T}}{s _{F}}

In finance, signal-to-noise ratio is usually very low, meaning most strategies are false.

So the number of true strategies is:

s_{T} = \frac{θ s}{1 + θ}

and the number of false ones:

s_{F} = \frac{s}{1 + θ}

Statistical Errors

False Positive Rate (Type I error): $α$
- False Positives (FP): $FP = α s_{F}$
- True Negatives (TN): $TN = (1 - α) s_{F}$
False Negative Rate (Type II error): $β$
- False Negatives (FN): $FN = β s_{T}$
- True Positives (TP): $TP = (1 - β) s_{T}$

The Precision (Positive Predictive Value) is defined as:

precision = \frac{TP}{TP + FP} = \frac{( 1 - β ) s _{T}}{( 1 - β ) s _{T} + α s _{F}} = \frac{( 1 - β ) θ}{( 1 - β ) θ + α}

and the Recall (Sensitivity) as:

recall = \frac{TP}{TP + FN} = 1 - β

Precision strongly depends on odds ratio θ.

A discovered strategy is more likely to be false than true if:

(1 - β) θ < α

Type $I$ Error Across Multiple Trials

(False alarm: see a signal where none exists, identify a false positive)

In a single trial :

$P$ (Type I error) = $α$
$P$ (No Type I error) = $1 - α$

with K independent trials:

$P$ (No Type I error in any trial): $(1 - α)^{K}$
$P$ (At least one Type I error) = Family-Wise Error Rate (FWER):

α_{K} = 1 - (1 - α)^{K}

FWER increases with K , the more strategies you test, the higher the chance of finding a false positive.

Type $II$ Error Across Multiple Trials

(Missed discovery: fail to detect a true strategy)

In a single trial::

$P$ (Type II error) = $β$

with K trialsP(Missing all positives):

$P$ (Type II error) $= β_{K} = β^{K}$

As $K ↑$ :

$α_{K} ↑$ (false positive risk grows)
$β_{K} ↓$ (less likely to miss all true positives)

Precision & Recall Adjusted for Multiple Testing

The Adjusted Precision can be defined as:

precision_{K} = \frac{( 1 - β ^{K} ) θ}{( 1 - β ^{K} ) θ + α _{K}} = \frac{( 1 - β ^{K} ) θ}{( 1 - β ^{K} ) θ + 1 - ( 1 - α ) ^{K}}

and the Adjusted Recall as:

recall_{K} = 1 - β^{K}

As $K$ increases:

Recall improves (lower chance of missing all true strategies).

Precision usually drops (higher risk of including false positives).

The Sharpe Ratio Framework

Given the strategy excess returns:

r_{t} \sim N (μ, σ^{2}), t = 1, \dots, T

and the Population (true) Sharpe Ratio:

SR = \frac{μ}{σ}

The Sample (estimated) Sharpe Ratio is defined as:

SR = \frac{r ˉ}{s _{r}}

where $\overset{r}{ˉ}$ is sample mean return, $s_{r}$ sample standard deviation.

The distribution of Sample SR can be defined as following.

Case 1: IID Normal Returns

T (SR - SR) d N (0, 1 + \frac{S R ^{2}}{2})

Case 2: Non-Normal Returns

Empirical evidence: hedge fund returns often exhibit:

Negative skewness
Positive excess kurtosis

T (SR - SR) d N (0, 1 + \frac{S R ^{2}}{2} (1 - γ_{3} SR + \frac{γ _{4} - 3}{4} S R^{2}))

where:

$γ_{3}$ = skewness
$γ_{4}$ = kurtosis

Special case: for Normal returns ( $γ_{3} = 0, γ_{4} = 3$ ), we recover the initial formula.

SBuMT and Maximum Sharpe Ratios

Given $K$ independent strategies with Sharpe ratios:

S R_{k} \sim N (0, V [S R_{k}]), k = 1, \dots, K

the Expected Maximum of K Sharpe Ratios

E [max S R_{k}] \approx (1 - γ) Z^{- 1} (1 - \frac{1}{K}) + γ Z^{- 1} (1 - \frac{1}{Ke})

where:

$Z^{- 1} (\cdot)$ = quantile function of standard Normal
$γ \approx 0.5772$ = Euler–Mascheroni constant
$e \approx 2.718$

The more strategies you try ( $K ↑$ ), the higher the expected maximum Sharpe ratio by chance.

So a discovered strategy is likely a false positive unless:

k max S R_{k} ≫ E [k max S R_{k}]

Deflated Sharpe Ratio (DSR)

Adjusts Sharpe ratio for:

Multiple testing ( $K$ )
Sample size ( $T$ )
Skewness ( $γ_{3}$ ) and kurtosis ( $γ_{4}$ )

D SR = \frac{SR - E [ max _{k} S R _{k} ]}{Adjusted Variance}

A high DSR brings stronger evidence of true profitability. A low DSR means thatis likely a false discovery

Clustering Strategies & Cluster-Level Returns

N strategies are grouped into $K ≪ N$ clusters based on correlations, so it's possible to reduces dimensionality for multiple testing bias.

The aggregate return series for each cluster is:

S_{k, t} = i \in C_{k} \sum w_{k, i} r_{i, t}

Weights chosen to minimize cluster variance:

w_{k} = \frac{Σ _{k}^{- 1} 1 _{k}}{1 _{k}^{⊤} Σ _{k}^{- 1} 1 _{k}}

where:

$Σ_{k}$ = covariance matrix of returns within cluster $k$
$1_{k}$ = vector of ones of appropriate size

Clustering reduces effective number of trials $(K ≪ N)$ :

Lowers false discovery probability
Produces a more realistic null distribution
Improves precision of DSR-based significance testing

Machine Learning for Asset Managers

Index

1. Introduction

Why Not to Do Backtesting

Role of Machine Learning (ML)

Uses of ML

Types of Overfitting

2. Denoising and Detoning

Covariance Matrix

The Solution: Laloux Adjustment

Denoising

1. Decomposition of the denoised correlation matrix

2. Remove the market (detoning)

3. Portfolio optimization in the reduced (principal-component) space

What is f?

Shrinkage

3.Distance Metrics

Correlation

Absolute Correlation

Problems of Correlation

Entropy of a Discrete Random Variable

Joint Entropy

Conditional Entropy

Kullback–Leibler Divergence

Cross Entropy

Mutual Information

4. Optimal Clustering

Proximity matrix

Hierarchical clustering

Partitional clustering

Practical issues

Elbow method

Silhouette analysis

Practical checklist for clustering

Quality Measure q

Base Clustering Algorithm

Reclustering Low-Quality Clusters

5. Financial Labels

Types of Supervised Learning

Fixed-Horizon Labeling

Concerns with Fixed-Horizon Labeling

Triple-Barrier Method

Step-by-Step Procedure

Trend Scanning Method

Define a Trend

Test Statistical Significance

Assign Labels

Meta-Labeling

Expected Value & Risk

Expected Variance

Sharpe Ratio

Mapping to Position Size

Meta-Labeling for Position Size with Multiple Classifiers

Hypothesis Test

Mapping t to Bet Size

6. Feature Importance Analysis

P-Values (Statistical Significance)

Mean Decrease Impurity (MDI) — Tree-Based Models

Mean Decrease Accuracy (MDA)

Model Performance Metrics

Negative Average Likelihood (NegAL)

Probability-Weighted Accuracy (PWA)

Feature Clustering for Importance

7. Portfolio Construction

Classical Mean–Variance Optimization (Variance Minimization)

Useful Special Cases

Numerical Stability & Conditioning Problem

Nested Clustered Optimization (NCO)

8. Testing Set Overfitting

Selection Bias under Multiple Testing (SBuMT)

True vs. False Investment Strategies & Precision/Recall

Statistical Errors

Type I Error Across Multiple Trials

Type II Error Across Multiple Trials

Precision & Recall Adjusted for Multiple Testing

The Sharpe Ratio Framework

Case 1: IID Normal Returns

Case 2: Non-Normal Returns

SBuMT and Maximum Sharpe Ratios

Deflated Sharpe Ratio (DSR)

What is $f$ ?

Quality Measure $q$

Type $I$ Error Across Multiple Trials

Type $II$ Error Across Multiple Trials