Note
This document is an implementation of:
Stübinger, J., Mangold, B. and Krauss, C. Statistical arbitrage with vine copulas. Quantitative Finance, 2018.
C-vine Copula Strategy
With the power of vine copula, we are able to model the relation among several random variables. Speficifally we aim to trade based on the information generated from the vine copula model. Similar to traditional bivariate approaches, we use the conditional (cumulative) probability to gauge whether the target stock is underpriced or overpriced against other stocks, and generate trading signal based on them from a mean-reversion bet.
We aim to cover the following topics:
Idea and typical workflow of the C-vine copula approach
Strategy details
Comments for this approach
Overview of the C-vine Stategy
At first, vine copula is a very flexible model for dealing with multi-variate dependencies. It decomposes joint probability density into bivariate copula densities and marginal densities in a tree structure. R-vine is the most generic structure that is shared by all vine copulas, and in application people use C-vine and D-vine more often as special cases of a generic R-vine.
Note
With \(N\) random variables, there are \(N(N − 1)(N − 2)! 2^{(N − 2)(N − 3)/2}/2\) many possible structures for R-vines. Therefore numerically fitting an R-vine structure directly from data becomes a burden, and even if such a method exists and can be calculated in a reasonable time, it is likely subject to overfit using financial market data. In contrast, for a C-vine there are “only” \(N!\) possible structures, and it has much better interpretability.
Since C-vine has the star-like structure at each level of the tree, it is ideal for modeling dependencies where one specific random variable is key. Hence, although we are squeezing statistical arbitrage from an \(N\) stocks cohort, we are using the “1-vs-the rest” type of framework. The D-vine or more general R-vine strategy handling mispricings all together, that dynamically trades every stock based on the mispricing info, to the best of our knowlegde, is not yet developed, at least not available in literature.
We have the workflow as follows:
Get data
Figure out the C-vine structure
Calculate probability density
Calculate the conditional probability
Generate signals and trade
We will dive into the details for each part of the workflow.
C-vine Copula Strategy Workflow
For ease of demontration, assuming we are trading 4 stocks.
Step 1: Get data
We work with stocks’ daily returns data exclusively.
Note
Selecting trading candidates is a serious topic, and can largely determine if a strategy is profitable or not. There are 4 methods proposed in [Stübinger et al. 2018]:
pairwise Spearman’s rho,
multi-dimensional Spearman’s rho,
sum of distance in quantile plots to hyper-diagonal
copula-chi-square test for dependence.
Loosely speaking, the goal is to find stocks cohorts that are heavily dependent, such that a mean-reversion bet on relative mispricings is profitable. For more details in implementation, please refer to Vine Copula Partner Selection.
Suppose we have chosen a few cohorts and in each cohort it has a key stock element. We then need to turn the translate the stocks’ returns data into its quantiles (pseudo-observations) using empirical CDFs (ECDFs). We denote the pseudo-observations as \(u_i\)’s, and they are all uniform in \([0, 1]\).
Step 2: Figure out the C-vine Structure
To start with, there are \(4! = 24\) many possible C-vine structures. Without loss of generality assuming the stock 1 is our target stock. In the end, we aim to come up with this conditional density that indicates relative mispricing for stock 1 against 3 other stocks in this cohort:
If \(h(u_1 | u_2, u_3, u_4) < 0.5\) then that day’s return for stock 1 is considered lower than the mean of what the history tells and vice versa, similar to the “traditional” copula approaches with Mispricing Index.
In [Stübinger et al. 2018], the authors claimed that in order to calculate \(h(u_1 | u_2, u_3, u_4)\), stock 1 must never be at the center of every level of the tree. Remember that C-vine can be fully characterized by an ordered tuple \((c_4, c_3, c_2, c_1)\) where \(c_1\) is the center for the level-1 tree, \(c_2\) for level-2 and so on. Therefore, to make stock 1 never at the center (except at the tree root), it is equivalent to check each possibility generated by \((1, c_3, c_2, c_1)\), where \((c_3, c_2, c_1)\) is a permutation of \(\{2, 3, 4\}\). Obviously there are \(3! = 6\) many ways available.
Note
Disclaimer: I do not fully agree with the authors’ argument here, because all vine copula allows one to calculate the joint probability density \(f(u_1, u_2, u_3, u_4)\), for any given quantities \((u_1, u_2, u_3, u_4)\) in order to calculate \(h\) for any target stock, simply integrate along that marginal variable and then normalize by a full integration in that direction. For example:
Moreover, I think if we treat stock 1 as the key stock, it should be at the center of every level of the tree, because it makes intuitive sense to model all other stocks’ and their bivariate copula densities relation given the stock 1’s information. Stock 1 is the object of interest, and therefore should be the governing quantity here. C-vine structure intrinsically orders the marginal variables by their importance of interdependencies from its ordered tuple representation, and the target stock should be the most important (therefore at the end of the tuple). It is strange to argue that the least important stock in terms of interdependencies should become the target stock. I.e., 1 should be put at the end of the tuple, not at the beginning, and there are still 6 many possible structures for 4 stocks.
Currently we provide the implementation in [Stübinger et al. 2018] and the alternative method that puts the target stock at the center.
After chosen the 6 possible vine structures, we then fit everyone of them and calculate the associated AIC value. We choose the final C-vine structure among candidates with the lowest AIC value. It is equivalent in this case to directly comparing the log-likelihood and choose the one that constitutes the largest log-likelihood value. The exact fitting includes figuring out what type of (parametric) bivariate copula to use for every node, and the parameter(s) value that fits best the data.
Step 3: Probability Density
We aim to calculate \(f(u_1, u_2, u_3, u_4)\). This is straightforward once we fit the C-vine to training data. Now we are working on the trading priod data. At first we should map them into quantiles using the ECDFs trained in the training period. Then we can calculate directly the probability density for pseudo-observations, say \((u_1, u_2, u_3, u_4)\), by calculating every node at every level of the tree. Note that each node constitues a probability density, either marginal density (top of the tree) or the copula density (not top of the tree). And the final probability density is their product.
Step 4: Conditional Probability
We aim to calculate \(h(u_1| u_2, u_3, u_4)\), given that stock 1 is the target stock. Similarly we can compute for other stocks if they are the target. Here we use numerical integration for this value:
Keep in mind that this value is model dependent: it depends on which vine structure we are using, and the types of bivariate copulas and their parameters in each node. Some people denote the conditional probability as \(h_C\) to indicate that it depends on a copula.
Note
Here we computed \(h\) from the “bottom up” by marginal integration. The authors suggested computing from “top down” by taking partial differentiations from the copula definition \(C(u_1, u_2, u_3, u_4)\) (cumulative density):
I do not agree with the author’s approach for the following reason: Mathematically speaking it is the same. However vine copula only allows one to compute the joint density \(f\), and unlike “traditional” copula models where \(C\) is defined by definition. For vine copula \(C\) is found by Monte-Carlo integrations from \(f\). Also even if \(C\) can be found analytically, taking 3 numerical partial differentiations will likely yield more issues compared to numerically integrating just along 1 marginal variable.
Step 5: Generate Signals and Trade
For simplificaition say we have 2 cohorts after the stocks selection, each cohort has a target stock. And we have fitted 2 C-vine copulas respectively for the 2 cohorts with their own target stock. Without loss of generality, let us fix stock 1 for each cohort as its target stock. For conditional probability \(h\):
If \(h > 0.5\), stock 1’s return that day is higher than the historical average compared to other 3 stocks in the cohort.
If \(h < 0.5\), stock 1’s return that day is higher than the historical average compared to other 3 stocks in the cohort.
Therefore, we adopt the cumulative mispricing index framework. For each cohort we calculate the de-meaned cumulative sum of \(h\) as \(CMPI\), and formulate the trading signal using a Bollinger band: Denote the running average of \(CMPI\) in the fixed-length, moving time window as \(\hat{\mu}(t)\), the running standard deviation in the time window as \(\hat{\sigma}(t)\), and some positive constant \(k\) to control the Bollinger band’s width.
Short signal: When \(CMPI > \hat{\mu}(t) + k \hat{\sigma}(t)\).
Long signal: When \(CMPI < \hat{\mu}(t) - k \hat{\sigma}(t)\).
Exit signal: When \(CMPI\) crosses with \(\hat{\mu}(t)\).
Do nothing: Else.
Now we total net positions for each key stock in each cohort. Then we formulate our dollar-neutral strategy by trading against a cheap broad-based market index such as SPY, similar to the method used in [Avellaneda and Lee, 2010].
Note
It is totally possible that different cohorts can share the same key stock. The cohort and key stock information is generated from the stocks cohort selection methods.
It is no surprise that \(CMPI\) looks reasonably similar to cumulative log-returns of the target stock. This can be used as a sanity check on whether our vine copula model is off.
Implementation
Example
# Importing the module and other libraries
import pandas as pd
import numpy as np
from arbitragelab.copula_approach.vinecop_generate import CVineCop
from arbitragelab.copula_approach.vinecop_strategy import CVineCopStrat
from arbitragelab.copula_approach.copula_calculation import to_quantile
# Loading stocks data. Use 1 cohort for example: suppose stocks 'AAPL', 'MSFT', 'AMZN', 'FB'
# form the cohort, and 'AAPL' is the target stock.
sp500_returns = pd.read_csv('all_sp500_returns.csv', index_col='Dates', parse_dates=True)
returns_train = sp500_returns[['AAPL', 'MSFT', 'AMZN', 'FB']][:800]
returns_test = sp500_returns[['AAPL', 'MSFT', 'AMZN', 'FB']][800:1100]
prices_aapl_spy = pd.read_csv('all_sp500_prices.csv',
index_col='Dates', parse_dates=True)[['AAPL', 'SPY']][800:1100]
rts_train_quantiles, cdfs = to_quantile(returns_train)
rts_test_quantiles, _ = to_quantile(returns_train)
# Instantiate a CVineCop (C-vine copula) class to fit
cvinecop = CVineCop()
# Fit C-vine automatically, 'AAPL' is the target stock
# Note that pv_target_idx is indexed from 1
vinecop_structure = cvinecop.fit_auto(data=rts_train_quantiles, pv_target_idx=1, if_renew=True)
# Print the vine copula structure as a sanity check. You can directly fit internally by just
# using cvinecop.fit_auto(data=rts_train_quantiles, pv_target_idx=1, if_renew=True) if you
# do not want to print structures.
print(vinecop_structure)
# Instantiate a CVineCopStrat (trading) class from the fitted C-vine copula
cvstrat = CVineCopStrat(cvinecop)
# Generate positions over the test data, return the Bollinger band for sanity check
# Note that the cdfs should be from the training set.
positions, bband = cvstrat.get_positions_bollinger(returns=returns_test, cdfs=cdfs,
if_return_bollinger_band=True)
positions = positions.shift(1) # Avoid look-back bias
# Formulate units from the positions against the SPY index, with 100,000 dollar investment
units = cvstrat.positions_to_units_against_index(target_stock_prices=prices_aapl_spy['AAPL'],
index_prices=prices_aapl_spy['SPY'],
positions=positions,
multiplier=100000)
Research Notebooks
The following research notebook can be used to better understand the C-Vine copula strategy.
Comments
Vine copula provides a very flexible approach in modeling multi-variate dependencies. The C-vine structure specifically highlights a dominant component at every level of the tree, ideal for our “1-vs-the rest” trading strategy for capturing statistical arbitrage among multiple stocks, which non-quant strategies often omit or are still primative.
As promising as it looks, just like any other methods it inevitably bears the some drawbacks: