Vine Copula Partner Selection



This module contains implementation of the four partner selection approaches mentioned in Section 3.1.1 of Statistical Arbitrage with Vine Copulas.

In this paper, Stubinger, Mangold and Krauss developed a multivariate statistical arbitrage strategy based on vine copulas - a highly flexible instrument for linear and nonlinear multivariate dependence modeling. Pairs trading is a relative-value arbitrage strategy, where an investor seeks to profit from mean-reversion properties of the price spread between two co-moving securities. Existing literature focused on using bivariate copulas to model the dependence structure between two stock return time series, and to identify mispricings that can potentially be exploited in a pairs trading application.

This paper proposes a multivariate copula-based statistical arbitrage framework, where specifically, for each stock in the S&P 500 data base, the three most suitable partners are selected by leveraging different selection criteria. Then, the multivariate copula models are benchmarked to capture the dependence structure of the selected quadruples. Later on, the paper focusses on the generation of trading signals and backtesting.

Introduction

This module will focus on the various Partner Selection procedures and their implementations, as described in the paper. For every stock in the S&P 500, a partner triple is identified based on adequate measures of association. The following four partner selection approaches are implemented:

  • Traditional Approach - baseline approach where the high dimensional relation between the four stocks is approximated by their pairwise bivariate correlations via Spearman’s \(\rho\);

  • Extended Approach - calculating the multivariate version of Spearman’s \(\rho\) based on Schmid and Schmidt (2007);

  • Geometric Approach - involves calculating the sum of euclidean distances from the 4-dimensional hyper-diagonal;

  • Extremal Approach - involves calculating a non-parametric \(\chi^2\) test statistic based on Mangold (2015) to measure the degree of deviation from independence.

Firstly, all measures of association are calculated using the ranks of the daily discrete returns of our samples. Ranked transformation provides robustness against outliers.

Secondly, only the top 50 most highly correlated stocks are taken into consideration for each target stock, to reduce the computational burden.

The traditional, the extended, and the geometric approach share a common feature - they measure the deviation from linearity in ranks. All three aim at finding the quadruple that behaves as linearly as possible to ensure that there is an actual relation between its components to model. While it is true that this aspiration for linearity excludes quadruples with components that are not connected (say, independent), it also rules out nonlinear dependencies in ranks. On the other hand, the extremal approach tries to maximize the distance to independence with focus on the joint extreme observations.

Note

Out of the four approaches, only extremal approach takes into consideration both linear and non-linear dependencies. This results in a better preselection and thus better results compared to the other routines.

So, extremal approach is generally preferred and it should be considered as default for partner selection.

Traditional Approach

As a baseline approach, the high dimensional relation between the four stocks is approximated by their pairwise bi-variate correlations via Spearman’s \(\rho\). We used ranked returns data for this approach. In addition to the robustness obtained by rank transformation, it allows to capture non-linearities in the data to a certain degree.

The procedure is as follows:

  • Calculate the sum of all pairwise correlations for all possible quadruples, consisting of a fixed target stock.

  • Quadruple with the largest sum of pairwise correlations is considered the final quadruple and saved to the output matrix.

Implementation

Note

This approach takes around 25 ms to run for each target stock.

This module implements Copula-based Statistical Arbitrage tools.

class PartnerSelection(prices: DataFrame, n: int = 50)

Implementation of the Partner Selection procedures proposed in Section 3.1.1 in the following paper.

3 partner stocks are selected for a target stock based on four different approaches namely, Traditional approach, Extended approach, Geometric approach and Extremal approach.

In this module, target stock implies the ticker for which a unique combination of stocks is returned. The stocks present in this unique combination are called partner stocks.

Stübinger, J., Mangold, B. and Krauss, C., 2018. Statistical arbitrage with vine copulas. Quantitative Finance, 18(11), pp.1831-1849.

__init__(prices: DataFrame, n: int = 50)

Inputs the price series required for further calculations.

It also includes preprocessing steps described in the paper, before starting the Partner Selection procedures. These steps include, finding the returns and ranked returns of the stocks, and calculating the top n correlated stocks for each stock in the universe.

Parameters:
  • prices – (pd.DataFrame) Contains price series of all stocks in the universe.

  • n – (int) For each target stock, the total number of stocks taken into consideration for partner stocks in the final combination, from 500 stocks in the universe.

PartnerSelection.traditional(n_targets: int = 5) list

This method implements the first procedure described in Section 3.1.1.

For all possible quadruples of a given stock, we calculate the sum of all pairwise correlations. For every target stock the quadruple with the highest sum is returned.

Parameters:

n_targets – (int) Number of target stocks to select.

Returns:

(list) List of all selected quadruples.

Extended Approach

Schmid and Schmidt (2007) introduce multivariate rank based measures of association. This paper generalizes Spearman’s \(\rho\) to arbitrary dimensions - a natural extension of the traditional approach.

In contrast to the strictly bi-variate case, this extended approach – and the two following approaches – directly reflect multivariate dependence instead of approximating it by pairwise measures only. This approach provides a more precise modeling of high dimensional association and thus a better performance in trading strategies.

The procedure is as follows:

  • Calculate the multivariate version of Spearman’s \(\rho\) for all possible quadruples, consisting of a fixed target stock.

  • Quadruple with the largest value is considered the final quadruple and saved to the output matrix.

\(d\) denotes the number of stocks daily returns observed from day \(1\) to day \(n\). \(X_i\) denotes the \(i\)-th stock’s return.

  1. We calculate the empirical cumulative density function (ECDF) \(\hat{F}_i\) for stock \(i\).

  2. Calculate quantile data for each \(X_{i}\), by

\[\hat{U}_i = \frac{1}{n} (\text{rank of} \ X_i) = \hat{F}_i(X_i)\]

The formula for the three estimators are given below, as in the paper.

\[ \begin{align}\begin{aligned}\hat{\rho}_1 = h(d) \times \Bigg\{-1 + \frac{2^d}{n} \sum_{j=1}^n \prod_{i=1}^d (1 - \hat{U}_{ij}) \Bigg\}\\\hat{\rho}_2 = h(d) \times \Bigg\{-1 + \frac{2^d}{n} \sum_{j=1}^n \prod_{i=1}^d \hat{U}_{ij} \Bigg\}\\\hat{\rho}_3 = -3 + \frac{12}{n {d \choose 2}} \times \sum_{k<l} \sum_{j=1}^n (1-\hat{U}_{kj})(1-\hat{U}_{lj})\end{aligned}\end{align} \]

Where:

\[h(d) = \frac{d+1}{2^d - d -1}\]

We use the mean of the above three estimators as the final measure used to return the final quadruple.

Implementation

Note

This approach takes around 500 ms to run for each target stock.

PartnerSelection.extended(n_targets: int = 5) list

This method implements the second procedure described in Section 3.1.1.

It involves calculating the multivariate version of Spearman’s correlation for all possible quadruples of a given stock. The final measure taken into consideration is the mean of the three versions of Spearman’s rho given in Schmid and Schmidt (2007). For every target stock the quadruple with the highest calculated measure is returned.

Parameters:

n_targets – (int) Number of target stocks to select.

Returns:

(list) List of all selected quadruples.

Geometric Approach

This approach tries to measure the geometric relation between the stocks in the quadruple.

Consider the relative ranks of a bi-variate random sample, where every observation takes on values in the \([0,1] \times [0,1]\) square. If there exists a perfect linear relation among both the ranks of the components of the sample, a plot of the relative ranks would result in a perfect line of dots between the points (0,0) and (1,1) – the diagonal line. However, if this relation is not perfectly linear, at least one point differs from the diagonal. The Euclidean distance of all ranks from the diagonal can be used as a measure of deviation from linearity, the diagonal measure.

Hence, we try to find the quadruple \(Q\) that leads to the minimal value of the sum of these Euclidean distances.

The procedure is as follows:

  • Calculate the four dimensional diagonal measure for all possible quadruples, consisting of a fixed target stock.

  • Quadruple with the smallest diagonal measure is considered the final quadruple and saved to the output matrix.

The diagonal measure in four dimensional space is calculated using the following equation,

\[\sum_{i=1}^{n} | (P - P_{1}) - \frac{(P - P_{1}) \cdot (P_{2} - P_{1})}{| P_{2} -P_{1} |^{2}} (P_{2} - P_{1}) |\]

where,

\[P_{1} = (0,0,0,0)\]
\[P_{2} = (1,1,1,1)\]

are points on the hyper-diagonal, and

\[P = (u_{1},u_{2},u_{3},u_{4})\]

where \(u_i\) represents the ranked returns of a stock \(i\) in quadruple.

Implementation

Note

This approach takes around 180 ms to run for each target stock.

PartnerSelection.geometric(n_targets: int = 5) list

This method implements the third procedure described in Section 3.1.1.

It involves calculating the four dimensional diagonal measure for all possible quadruples of a given stock. For example, visually, say we are in 2D, we have a Quantile-Quantile plot for the data, and this measure is just the sum of Euclidean distance for all data points to the y=x line (diagonal). For every target stock the quadruple with the lowest diagonal measure is returned.

Parameters:

n_targets – (int) Number of target stocks to select.

Returns:

(list) List of all selected quadruples.

Extremal Approach

Mangold (2015) proposes a nonparametric test for multivariate independence. The resulting \(\chi^2\) test statistic can be used to measure the degree of deviation from independence, so dependence. The value of the measure increases on the occurence of an abnormal number of joint extreme events.

The procedure is as follows:

  • Calculate the \(\chi^2\) test statistic for all possible quadruples, consisting of a fixed target stock.

  • Quadruple with the largest test statistic is considered the final quadruple and saved to the output matrix.

Given below are the steps to calculate the \(\chi^2\) test statistic described in Proposition 3.3 of Mangold (2015):

Note

These steps assume a 4-dimensional input.

  1. Analytically calculate the 4-dimensional Nelsen copula from Definition 2.4 in Mangold (2015):

\[C_{\theta}(u_{1}, u_{2}, u_{3}, u_{4}) = u_1u_2u_3u_4 \times (1 + ((1- u_{1})(1- u_{2})(1- u_{3})(1- u_{4})) *\]
\[(\theta_{1} ((1- u_{1})(1- u_{2})(1- u_{3})(1- u_{4})) + \theta_{2} ((1- u_{1})(1- u_{2})(1- u_{3})u_{4}) +\]
\[\theta_{3} ((1- u_{1})(1- u_{2})u_{3}(1- u_{4})) + \theta_{4} ((1- u_{1})(1- u_{2})u_{3}u_{4}) +\]
\[\theta_{5} ((1- u_{1})u_{2}(1- u_{3})(1- u_{4})) + \theta_{6} ((1- u_{1})u_{2}(1- u_{3})u_{4}) +\]
\[\theta_{7} ((1- u_{1})u_{2}u_{3}(1- u_{4})) + \theta_{8} ((1- u_{1})u_{2}u_{3}u_{4}) +\]
\[\theta_{9} (u_{1}(1- u_{2})(1- u_{3})(1- u_{4})) + \theta_{10} (u_{1}(1- u_{2})(1- u_{3})u_{4}) +\]
\[\theta_{11} (u_{1}(1- u_{2})u_{3}(1- u_{4})) + \theta_{12} (u_{1}(1- u_{2})u_{3}u_{4}) +\]
\[\theta_{13} (u_{1}u_{2}(1- u_{3})(1- u_{4})) + \theta_{14} (u_{1}u_{2}(1- u_{3})u_{4}) +\]
\[\theta_{15} (u_{1}(1- u_{2})u_{3}(1- u_{4})) + \theta_{16} (u_{1}u_{2}u_{3}u_{4}) )\]
  1. Analytically calculate the corresponding density function of 4-dimensional copula:

\[c_{\theta}(u_{1}, u_{2}, u_{3}, u_{4}) = \frac{\partial^{4}}{\partial u_{1} \partial u_{2} \partial u_{3}\partial u_{4}}C_{\theta}(u_{1}, u_{2}, u_{3}, u_{4})\]

Note

To calculate the density function, we can notice a pattern in the copula equation. The form of each \(u_i\) beside \(\theta_i\) is either

\[u_{i}(1- u_{i})^2 \quad or \quad u_{i}^2(1 - u_{i})\]

and the corresponding partial derivatives of these two forms are,

\[(u_{i} - 1)(3u_{i} - 1) \quad or \quad u_{i}(2 - 3u_{i})\]

This observation simplifies the analytical calculation of the density function.

  1. Calculate the Partial Derivative of above density function \(w.r.t \ \theta\).

\[\dot{c_{\theta}} = \frac{\partial c_{\theta}(u_{1}, u_{2}, u_{3}, u_{4})}{\partial \theta}\]
  1. Calculate the Test Statistic for p-dimensional rank test:

\[T=n \boldsymbol{T}_{p, n}^{\prime} \Sigma\left(\dot{c}_{\theta_{0}}\right)^{-1} \boldsymbol{T}_{p, n} \stackrel{a}{\sim} \chi^{2}(q)\]

where,

\[\boldsymbol{T}_{p, n}=\mathbb{E}\left[\left.\frac{\partial}{\partial \theta} \log c_{\theta}(B)\right|_{\theta=\theta_{0}}\right]\]
\[\Sigma\left(\dot{c}_{0}\right)_{i, j}=\int_{[0,1]^{p}}\left(\left.\frac{\partial c_{\theta}(\boldsymbol{u})} {\partial \theta_{i}}\right| _{\boldsymbol{\theta}=\mathbf{0}}\right) \times\left(\left.\frac{\partial c_{\theta}(\boldsymbol{u})} {\partial \theta_{j}}\right |_{\theta=0}\right) \mathrm{d} \boldsymbol{u}\]

Implementation

Note

This approach is extremely heavy compared to other approaches and takes around 15 sec to run for each target stock.

Please be aware that there is a big overhead at the start of this method which involves calculating the covariance matrix. This should take around 1 to 2 minutes when d = 4 which is the default. Increasing the value of d will increase the processing time significantly.

PartnerSelection.extremal(n_targets: int = 5, d: int = 4) list

This method implements the fourth procedure described in Section 3.1.1.

It involves calculating a non-parametric test statistic based on Mangold (2015) to measure the degree of deviation from independence. Main focus of this measure is the occurrence of joint extreme events.

Parameters:
  • n_targets – (int) Number of target stocks to select.

  • d – (int) Number of partner stocks(including target stock).

Return output_matrix:

(list) List of all selected combinations.

Code Example

# Importing the module and other libraries
from arbitragelab.copula_approach.vine_copula_partner_selection import PartnerSelection
import pandas as pd

# Importing DataFrame of daily pricing data for all stocks in S&P 500.(at least 12 months data)
df = pd.read_csv(DATA_PATH, parse_dates=True, index_col='Date').dropna()

# Instantiating the partner selection module.
ps = PartnerSelection(df)

# Calculating final quadruples using traditional approach for first 20 target stocks.
Q = ps.traditional(20)
print(Q)
# Plotting the final quadruples.
ps.plot_selected_pairs(Q)

# Calculating final quadruples using extended approach for first 20 target stocks.
Q = ps.extended(20)
print(Q)
# Plotting the final quadruples.
ps.plot_selected_pairs(Q)

# Calculating final quadruples using geometric approach for first 20 target stocks.
Q = ps.geometric(20)
print(Q)
# Plotting the final quadruples.
ps.plot_selected_pairs(Q)

# Calculating final quadruples using extremal approach for first 20 target stocks.
Q = ps.extremal(20)
print(Q)
# Plotting the final quadruples.
ps.plot_selected_pairs(Q)

Research Notebooks

The following research notebook can be used to better understand the partner selection approaches described above.

Research Article


Presentation Slides


References