Note

The following documentation closely follows a paper by Gatev et al. (2006): Pairs trading: Performance of a relative-value arbitrage rule.

As well as a paper by Christopher Krauss (2015): Statistical arbitrage pairs trading strategies: Review and outlook.

Distance Approach


The distance approach was introduced in the paper by Gatev et al. (2006), and it is one of the most cited pairs trading papers at the time of writing this documentation. The approach described in the paper is the following: First, a historical period is defined, cumulative returns for assets in this period are normalized. Second, using the Euclidean squared distance on the normalized price time series, \(n\) closest pairs of assets are picked. In the original work, the historical period used to quantify distances between price series was set to 12 months, and 20 top pairs were chosen.

So no cointegration tests (as opposed to the mean reversion approach) are being performed in the distance approach. As spotted in the work by Krauss (2015), dependencies found using this approach can be spurious. This also leads to higher divergence risks, and as shown in the work by Do and Faff (2010), up to 32% of pairs identified by this method are not converging.

After the pairs are formed, the trading period starts, and the trading signals are generated. The mechanism behind this process if the following: If the difference between the price of elements in a pair diverged by more than 2 standard deviations (calculated for each pair during the training period), the positions are opened - long for the element with a lower price in a portfolio and short for an element with a higher price in a portfolio. These positions are closed when the normalized prices cross or when the trading period ends. Using this standard description, the distance approach is a parameter-free strategy.

There are, however, possible adjustments to this strategy, like choosing distances other from the Euclidean square distance, adjusting the threshold to enter a trade for each pair, etc.

Pairs Formation

This stage of the DistanceStrategy consists of the following steps:

  1. Normalization of the input data.

To use the Euclidean square distance, the training price time series are being normalized using the following formula:

\[P_{normalized} = \frac{P - min(P)}{max(P) - min(P)}\]

where \(P\) is the training price series of an asset, \(min(P)\) and \(max(P)\) are the minimum and maximum values from the price series.

  1. Finding pairs.

Using the normalized price series, the distances between each pair of assets are calculated. These distances are then sorted in the ascending order and the \(n\) closest pairs are picked (our function also allows skipping a number of first pairs, so one can choose pairs 10-15 to study).

The distances between elements (Euclidean square distance - SSD) are calculated as:

\[SSD = \sum^{N}_{t=1} (P^1_t - P^2_t)^{2}\]

where \(P^1_t\) and \(P^2_t\) are normalized prices at time \(t\) for the first and the second elements in a pair.

Using the prices of elements in a pair a portfolio is being constructed - the difference between their normalized prices.

  1. Calculating historical volatility.

For \(n\) portfolios (differences between normalized price series of elements) calculated in the previous step, their volatility is being calculated. Historical standard deviations of these portfolios will later be used to generate trading signals.

../_images/distance_approach_pair.png

An example of two normalized price series of assets that have low Euclidean square distance. These assets can be used to construct a portfolio.

../_images/distance_approach_portfolio.png

Portfolio value (difference of normalized price series), constructed from a pair of elements from the previous example.

Pair selection criteria

As basic pairs formation confirms declining profitability in pairs trading, some other refined pair selection criteria have emerged. Here, we describe three different methods from the basic approach in selecting pairs for trading.

First is only allowing for matching securities within the same industry group . The second is sorting selected pairs based on the number of zero-crossings in the formation period and the third is sorting selected pairs based on the historical standard deviation where pairs with high standard deviation are selected. These selection methods are inspired by the work by Do and Faff (2010, 2012).

  1. Pairs within the same industry group

In the pairs formation step above, one can add this method when finding pairs in order to match securities within the same industry group.

With a dictionary containing the name/ticker of the securities and each corresponding industry group, the securities are first separated into different industry groups. Then, by calculating the Euclidean square distance for each of the pair within the same group, the \(n\) closest pairs are selected(in default, our function also allows skipping a number of first pairs, so one can choose pairs 10-15 to study). This pair selection criterion can be used as default before adding other methods such as zero-crossings or variance if one gives a dictionary of industry group as an input.

  1. Pairs with a higher number of zero-crossings

The number of zero crossings in the formation period has a positive relation to the future spread convergence according to the work by Do and Faff (2010).

After pairs were matched either within the same industry group or every industry, the top \(n\) pairs that had the highest number of zero crossings during the formation period are admitted to the portfolio we select. This method incorporates the time-series dimension of the historical data in the form of the number of zero crossings.

  1. Pairs with a higher historical standard deviation

The historical standard deviation calculated in the formation period can also be a criterion to sort selected pairs. According to the work of Do and Faff(2010), as having a small SSD decreases the variance of the spread, this approach could increase the expected profitability of the method.

After pairs were matched, we can sort them based on their historical standard deviation in the formation period to select top \(n\) pairs with the highest variance of the spread.

Implementation

Implementation of the statistical arbitrage distance approach proposed by Gatev, E., Goetzmann, W. N., and Rouwenhorst, K. G. in “Pairs trading: Performance of a relative-value arbitrage rule.” (2006) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=141615.

class DistanceStrategy

Class for creation of trading signals following the strategy by Gatev, E., Goetzmann, W. N., and Rouwenhorst, K. G. in “Pairs trading: Performance of a relative-value arbitrage rule.” (2006) https://papers.ssrn.com/sol3/papers.cfm?abstract_id=141615.

__init__()

Initialize Distance strategy.

DistanceStrategy.form_pairs(train_data, method='standard', industry_dict=None, num_top=5, skip_top=0, selection_pool=50, list_names=None)

Forms pairs based on input training data.

This method includes procedures from the pairs formation step of the distance strategy.

First, the input data is being normalized using max and min price values for each series: Normalized = (Price - Min(Price)) / (Max(Price) - Min(Price))

Second, the normalized data is used to find a pair for each element - another series of prices that would have a minimum sum of square differences between normalized prices. Only unique pairs are picked in this step (pairs (‘AA’, ‘BD’) and (‘BD’, ‘AA’) are assumed to be one pair (‘AA’, ‘BD’)). During this step, if one decides to match pairs within the same industry group, with the industry dictionary given, the sum of square differences is calculated only for the pairs of prices within the same industry group.

Third, based on the desired number of top pairs to chose and the pairs to skip, they are taken from the list of created pairs in the previous step. Pairs are sorted so that ones with a smaller sum of square distances are placed at the top of the list.

Finally, the historical volatility for the portfolio of each chosen pair is calculated. Portfolio here is the difference of normalized prices of two elements in a pair. Historical volatility will later be used in the testing(trading) step of the distance strategy. The formula for calculating a portfolio price here: Portfolio_price = Normalized_price_A - Normalized_price_B

Note: The input dataframe to this method should not contain missing values, as observations with missing values will be dropped (otherwise elements with fewer observations would have smaller distance to all other elements).

Parameters:
  • train_data – (pd.DataFrame/np.array) Dataframe with training data used to create asset pairs.

  • num_top – (int) Number of top pairs to use for portfolio formation.

  • skip_top – (int) Number of first top pairs to skip. For example, use skip_top=10 if you’d like to take num_top pairs starting from the 10th one.

  • list_names – (list) List containing names of elements if Numpy array is used as input.

  • method – (str) Methods to use for sorting pairs [standard by default, variance, zero_crossing].

  • selection_pool – (int) Number of pairs to use before sorting them with the selection method.

  • industry_dict – (dict) Dictionary matching ticker to industry group.

Trading signals generation

After pairs were formed, we can proceed to the second stage of the DistanceStrategy - trading signals generation. The input to this stage is a dataframe with testing price series for assets - not used in the pairs formation stage.

This stage of the DistanceStrategy consists of the following steps:

  1. Normalization of the input data.

Using the same approach as in the pairs formation stage, we normalize the input trading dataset using the same maximum and minimum historical values from the training price series.

  1. Portfolios creation.

In this step, the portfolios are being constructed based on the asset pairs chosen in the pairs formation step. Portfolio values series are differences between normalized price series of elements in a pair - as we’re opening a long position for the first element in a pair and a short position for the second element in a pair. A buy signal generated by the strategy means going long on the first element and short on the second. A sell signal means the opposite - going short on the first element and long on the second element.

  1. Generating signals.

If the portfolio value exceeds two historical deviations, a sell signal is generated - we expect the price of the first element to decrease and the price of the second element to increase. And if the value of the portfolio is below minus two historical deviations, a buy signal is generated.

An open position is closed when the portfolio value crosses the zero mark - or when the prices of elements in a pair cross. So at any given time, we have one (buy or sell) or none active positions opened. This makes cost allocation for the strategy easier. Resulting trading signals are target quantities of portfolios to hold for each pair (with values -1, 0, or +1).

../_images/distance_approach_results_portfolio.png

Portfolio value for a pair of assets and the corresponding target quantities to hold for each observation.

Implementation

DistanceStrategy.trade_pairs(test_data, divergence=2)

Generates trading signals for formed pairs based on new testing(trading) data.

This method includes procedures from the trading step of the distance strategy.

First, the input test data is being normalized with the min and max price values from the pairs formation step (so we’re not using future data when creating signals). Normalized = (Test_Price - Min(Train_Price)) / (Max(Train_Price) - Min(Train_Price))

Second, pair portfolios (differences of normalized price series) are constructed based on the chosen top pairs from the pairs formation step.

Finally, for each pair portfolio trading signals are created. The logic of the trading strategy is the following: we open a position when the portfolio value (difference between prices) is bigger than divergence * historical_standard_deviation. And we close the position when the portfolio price changes sign (when normalized prices of elements cross).

Positions are being opened in two ways. We open a long position on the first element from pair and a short position on the second element. The price of a portfolio is then:

Portfolio_price = Normalized_price_A - Normalized_price_B

If Portfolio_price > divergence * st_deviation, we open a short position on this portfolio.

IF Portfolio_price < - divergence * st_deviation, we open a long position on this portfolio.

Both these positions will be closed once Portfolio_price reaches zero.

Parameters:
  • test_data – (pd.DataFrame/np.array) Dataframe with testing data used to create trading signals. This dataframe should contain the same columns as the dataframe used for pairs formation.

  • divergence – (float) Number of standard deviations used to open a position in a strategy. In the original example, 2 standard deviations were used.

Results output and plotting

The DistanceStrategy class contains multiple methods to get results in the desired form.

Functions that can be used to get data:

  • get_signals() outputs generated trading signals for each pair.

  • get_portfolios() outputs values series of each pair portfolios.

  • get_scaling_parameters() outputs scaling parameters from the training dataset used to normalize data.

  • get_pairs() outputs a list of tuples, containing chosen top pairs in the pairs formation step.

  • get_num_crossing() outputs a list of tuples, containing chosen top pairs with its number of zero-crossings.

Functions that can be used to plot data:

  • plot_pair() plots normalized price series for elements in a given pair and the corresponding trading signals for portfolio of these elements.

  • plot_portfolio() plots portfolio value for a given pair and the corresponding trading signals.

Implementation

DistanceStrategy.get_signals()

Outputs generated trading signals for pair portfolios.

Returns:

(pd.DataFrame) Dataframe with trading signals for each pair. Trading signal here is the target quantity of portfolios to hold.

DistanceStrategy.get_portfolios()

Outputs pair portfolios used to generate trading signals.

Returns:

(pd.DataFrame) Dataframe with portfolios for each pair.

DistanceStrategy.get_scaling_parameters()

Outputs minimum and maximum values used for normalizing each price series.

Formula used for normalization: Normalized = (Price - Min(Price)) / (Max(Price) - Min(Price))

Returns:

(pd.DataFrame) Dataframe with columns ‘min_value’ and ‘max_value’ for each element.

DistanceStrategy.get_pairs()

Outputs pairs that were created in the pairs formation step and sorted by the method.

Returns:

(list) List containing tuples of two strings, for names of elements in a pair.

DistanceStrategy.get_num_crossing()

Outputs pairs that were created in the pairs formation step with its number of zero crossing.

Returns:

(dict) Dictionary with keys as pairs and values as the number of zero crossings for pairs.

DistanceStrategy.plot_pair(num_pair)

Plots prices for a pair of elements and trading signals generated for their portfolio.

Parameters:

num_pair – (int) Number of the pair from the list to use for plotting.

Returns:

(plt.Figure) Figure with prices for pairs plot and trading signals plot.

DistanceStrategy.plot_portfolio(num_pair)

Plots a pair portfolio (difference between element prices) and trading signals generated for it.

Parameters:

num_pair – (int) Number of the pair from the list to use for plotting.

Returns:

(plt.Figure) Figure with portfolio plot and trading signals plot.

Examples

Code Example

# Importing packages
import pandas as pd
from arbitragelab.distance_approach.basic_distance_approach import DistanceStrategy

# Getting the dataframe with price time series for a set of assets
data = pd.read_csv('X_FILE_PATH.csv', index_col=0, parse_dates = [0])

# Dividing the dataset into two parts - the first one for pairs formation
data_pairs_formation = data.loc[:'2019-01-01']

# And the second one for signals generation
data_signals_generation = data.loc['2019-01-01':]

# Performing the pairs formation stage of the DistanceStrategy
# Choosing pairs 5-25 from top pairs to construct portfolios
strategy = DistanceStrategy()
strategy.form_pairs(data_pairs_formation, num_top=20, skip_top=5)

# Adding an industry-based selection criterion to The DistanceStrategy
strategy_industry = DistanceStrategy()
strategy_industry.form_pairs(data_pairs_formation, industry_dict=industry_dict,
                             num_top=20, skip_top=5)

# Using the number of zero-crossing for pair selection after industry-based selection
strategy_zero_crossing = DistanceStrategy()
strategy_zero_crossing.form_pairs(data_pairs_formation, method='zero_crossing',
                                  industry_dict=industry_dict, num_top=20, skip_top=5)

# Checking a list of pairs that were created
pairs = strategy.get_pairs()

# Checking a list of pairs with the number of zero crossings
num_crossing = strategy.get_num_crossing()

# Now generating signals for formed pairs, using (2 * st. variation) as a threshold
# to enter a position
strategy.trade_pairs(data_signals_generation, divergence=2)

# Checking portfolio values for pairs and generated trading signals
portfolios = strategy.get_portfolios()
signals = strategy.get_signals()

# Plotting price series for elements in the second pair (counting from zero)
# and corresponding trading signals for the pair portfolio
figure = strategy.plot_pair(1)

Research Notebooks

The following research notebook can be used to better understand the distance approach described above.

Research Article


Presentation Slides


References