Cointegration Rules Spread Selection

The rules selection flow diagram from A Machine Learning based Pairs Trading Investment Strategy. by Simão Moraes Sarmento and Nuno Horta.

The rules that each spread needs to pass are:

  • The spread’s constituents are cointegrated. Literature suggests cointegration performs better than minimum distance and correlation approaches

  • The spread’s spread Hurst exponent reveals a mean-reverting character. Extra layer of validation.

  • The spread’s spread diverges and converges within convenient periods.

  • The spread’s spread reverts to the mean with enough frequency.

../_images/pairs_selection_rules_diagram1.png

To test for cointegration, the framework proposes the application of the Engle-Granger test, due to its simplicity. One critic Armstrong (2001) points at the Engle-Granger test sensitivity to the ordering of variables. It is a possibility that one of the relationships will be cointegrated, while the other will not. This is troublesome because we would expect that if the variables are truly cointegrated the two equations will yield the same conclusion.

To mitigate this issue, the original paper proposes that the Engle-Granger test is run for the two possible selections of the dependent variable and that the combination that generated the lowest t-statistic is selected. Further work in Hoel (2013) adds on, “the unsymmetrical coefficients imply that a hedge of long / short is not the opposite of long / short , i.e. the hedge ratios are inconsistent”.

A better solution is proposed and implemented, based on Gregory et al. (2011) to use orthogonal regression – also referred to as Total Least Squares (TLS) – in which the residuals of both dependent and independent variables are taken into account. That way, we incorporate the volatility of both legs of the spread when estimating the relationship so that hedge ratios are consistent, and thus the cointegration estimates will be unaffected by the ordering of variables.

Hudson & Thames research team has also found out that optimal (in terms of cointegration tests statistics) hedge ratios are obtained by minimizng spread’s half-life of mean-reversion. Alongside this hedge ration calculation method, there is a wide variety of algorithms to choose from: TLS, OLS, Johansen Test Eigenvector, Box-Tiao Canonical Decomposition, Minimum Half-Life, Minimum ADF Test T-statistic Value.

Note

More information about the hedge ratio methods and their use can be found in the Hedge Ratio Calculations section of the documentation.

Secondly, an additional validation step is also implemented to provide more confidence in the mean-reversion character of the pairs’ spread. The condition imposed is that the Hurst exponent associated with the spread of a given pair is enforced to be smaller than 0.5, assuring the process leans towards mean-reversion.

In third place, the pair’s spread movement is constrained using the half-life of the mean-reverting process. In the framework paper the strategy built on top of the selection framework is based on the medium term price movements, so for this reason the spreads that either have very short (< 1 day) or very long mean-reversion (> 365 days) periods were not suitable.

Lastly, we enforce that every spread crosses its mean at least once per month, to provide enough liquidity and thus providing enough opportunities to exit a position.

Note

In practice to calculate the spread of the pairs supplied by this module, it is important to also consider the hedge ratio as follows:

\(S = leg1 - (hedgeratio_2) * leg2 - (hedgeratio_3) * leg3 - .....\)

Warning

The pairs selection function is very computationally heavy, so execution is going to be long and might slow down your system.

Note

The user may specify thresholds for each pair selector rule from the framework described above. For example, Engle-Granger test 99% threshold may seem too strict in pair selection which can be decreased to either 95% or 90%. On the other hand, the user may impose more strict thresholds on half life of mean reversion.

Note

H&T teams has extended pair selection rules to higher dimensions such that filtering rules can be applied to any spread, not only pairs. As a result, the module can be applied in statistical arbitrage applications.

Implementation

This module implements the ML based Pairs Selection Framework described by Simão Moraes Sarmento and Nuno Horta in “A Machine Learning based Pairs Trading Investment Strategy.”.

class CointegrationSpreadSelector(prices_df: DataFrame | None = None, baskets_to_filter: list | None = None)

Implementation of the Proposed Pairs Selection Framework in the following paper: “A Machine Learning based Pairs Trading Investment Strategy.”. H&T team improved it to work not only with pairs, but also with spreads.

__init__(prices_df: DataFrame | None = None, baskets_to_filter: list | None = None)

Constructor. Sets up the price series needed for the next step.

Parameters:
  • prices_df – (pd.DataFrame) Asset prices universe.

  • baskets_to_filter – (list) List of tuples of tickers baskets to filter (can be pairs (AAA, BBB) or higher dimensions (AAA, BBB, CCC)).

CointegrationSpreadSelector.select_spreads(hedge_ratio_calculation: str = 'OLS', adf_cutoff_threshold: float = 0.95, hurst_exp_threshold: float = 0.5, min_crossover_threshold: int = 12, min_half_life: float = 365) list

Apply cointegration selection rules (ADF, Hurst, Min SMA crossover, Min Half-Life) to filter-out pairs/baskets.

Check to see if pairs comply with the criteria supplied in the paper: the pair being cointegrated, the Hurst exponent being <0.5, the spread moves within convenient periods and finally that the spread reverts to the mean with enough frequency.

Parameters:
  • hedge_ratio_calculation – (str) Defines how hedge ratio is calculated. Can be either ‘OLS’, ‘TLS’ (Total Least Squares), ‘min_half_life’, ‘min_adf’, ‘johansen’, ‘box_tiao’.

  • adf_cutoff_threshold – (float) ADF test threshold used to define if the spread is cointegrated. Can be 0.99, 0.95 or 0.9.

  • hurst_exp_threshold – (float) Max Hurst threshold value.

  • min_crossover_threshold – (int) Minimum amount of mean crossovers per analysed period.

  • min_half_life – (float) Minimum Half-Life of mean reversion value in units of time used.

Returns:

(list) Tuple list of final pairs.

CointegrationSpreadSelector.generate_spread_statistics(spread_series: Series, log_info: bool = True) dict

Generate spread filtering statistics (Hurst, ADF, HL, Crossovers).

Parameters:
  • spread_series – (pd.Series) Spread values series.

  • log_info – (bool) Flag indicating that information should be logged into self.selection_logs.

Returns:

(dict) Dictionary with statistics.

CointegrationSpreadSelector.construct_spreads(hedge_ratio_calculation: str) dict

For self.baskets_to_filter construct spreads and log hedge ratio calculated based on hedge_ratio_calculation.

Parameters:

hedge_ratio_calculation – (str) Defines how hedge ratio is calculated. Can be either ‘OLS’, ‘TLS’ (Total Least Squares), ‘min_half_life’, ‘min_adf’, ‘johansen’, ‘box_tiao’.

Returns:

(dict) Dictionary of generated spreads (tuple: pd.Series).

CointegrationSpreadSelector.apply_filtering_rules(adf_cutoff_threshold: float = 0.95, hurst_exp_threshold: float = 0.5, min_crossover_threshold: int = 12, min_half_life: float = 365) list

Apply cointegration selection rules (ADF, Hurst, Min SMA crossover, Min Half-Life) to filter-out pairs/baskets.

Check to see if pairs comply with the criteria supplied in the paper: the pair being cointegrated, the Hurst exponent being <0.5, the spread moves within convenient periods and finally that the spread reverts to the mean with enough frequency.

Parameters:
  • adf_cutoff_threshold – (float) ADF test threshold used to define if the spread is cointegrated. Can be 0.99, 0.95 or 0.9.

  • hurst_exp_threshold – (float) Max Hurst threshold value.

  • min_crossover_threshold – (int) Minimum amount of mean crossovers per analysed period.

  • min_half_life – (float) Minimum Half-Life of mean reversion value in units of time used.

Returns:

(list) Tuple list of final pairs.

Examples

# Importing packages
import pandas as pd
import numpy as np
from arbitragelab.spread_selection.cointegration import CointegrationSpreadSelector

url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/arbitrage_lab_data/sp100_prices.csv"
data = pd.read_csv(url, index_col=0, parse_dates=[0])
input_spreads = [('ABMD', 'AZO'), ('AES', 'BBY'), ('BKR', 'CE'), ('BKR', 'CE', 'AMZN')]
pairs_selector = CointegrationSpreadSelector(prices_df=data, baskets_to_filter=input_spreads)
filtered_spreads = pairs_selector.select_spreads(hedge_ratio_calculation='TLS',
                                                 adf_cutoff_threshold=0.9,
                                                 hurst_exp_threshold=0.55,
                                                 min_crossover_threshold=0,
                                                 min_half_life=20)

# Statistics logs data frame.
logs = pairs_selector.selection_logs.copy()

# A user may also specify own constructed spread to be tested.
spread = pd.read_csv('spread.csv', index_col=0, parse_dates=[0])
pairs_selector = CointegrationSpreadSelector(prices_df=None, baskets_to_filter=None)
# Using log_info=True to save stats.
stats = pairs_selector.generate_spread_statistics(spread, log_info=True)
print(pairs_selector.selection_logs)
print(stats)
filtered_spreads = pairs_selector.apply_filtering_rules(adf_cutoff_threshold=0.99,
                                                        hurst_exp_threshold=0.5)

Research Notebooks

The following research notebook can be used to better understand the Cointegration Rules Spread Selection described above.

References