arbitragelab.codependence.gnpr_distance
Implementation of distance using the Generic Non-Parametric Representation approach from “Some contributions to the clustering of financial time series and applications to credit default swaps” by Gautier Marti https://www.researchgate.net/publication/322714557
Module Contents
Functions
|
Calculates a statistical estimate of Spearman's rho - a copula-based dependence measure. |
|
Calculates the distance between two Gaussians under the Generic Parametric Representation (GPR) approach. |
|
Calculates the empirical distance between two random variables under the Generic Non-Parametric Representation |
- spearmans_rho(x: numpy.array, y: numpy.array) float
Calculates a statistical estimate of Spearman’s rho - a copula-based dependence measure.
Formula for calculation: rho = 1 - (6)/(T*(T^2-1)) * Sum((X_t-Y_t)^2)
It is more robust to noise and can be defined if the variables have an infinite second moment. This statistic is described in more detail in the work by Gautier Marti https://www.researchgate.net/publication/322714557 (p.54)
This method is a wrapper for the scipy spearmanr function. For more details about the function and its parameters, please visit scipy documentation https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.spearmanr.html
- Parameters:
x – (np.array/pd.Series) X vector
y – (np.array/pd.Series) Y vector (same number of observations as X)
- Returns:
(float) Spearman’s rho statistical estimate
- gpr_distance(x: numpy.array, y: numpy.array, theta: float) float
Calculates the distance between two Gaussians under the Generic Parametric Representation (GPR) approach.
According to the original work https://www.researchgate.net/publication/322714557 (p.70): “This is a fast and good proxy for distance d_theta when the first two moments … predominate”. But it’s not a good metric for heavy-tailed distributions.
Parameter theta defines what type of information dependency is being tested: - for theta = 0 the distribution information is tested - for theta = 1 the dependence information is tested - for theta = 0.5 a mix of both information types is tested
With theta in [0, 1] the distance lies in range [0, 1] and is a metric. (See original work for proof, p.71)
- Parameters:
x – (np.array/pd.Series) X vector.
y – (np.array/pd.Series) Y vector (same number of observations as X).
theta – (float) Type of information being tested. Falls in range [0, 1].
- Returns:
(float) Distance under GPR approach.
- gnpr_distance(x: numpy.array, y: numpy.array, theta: float, n_bins: int = 50) float
Calculates the empirical distance between two random variables under the Generic Non-Parametric Representation (GNPR) approach.
Formula for the distance is taken from https://www.researchgate.net/publication/322714557 (p.72).
Parameter theta defines what type of information dependency is being tested: - for theta = 0 the distribution information is tested - for theta = 1 the dependence information is tested - for theta = 0.5 a mix of both information types is tested
With theta in [0, 1] the distance lies in the range [0, 1] and is a metric. (See original work for proof, p.71)
This method is modified as it uses 1D Optimal Transport Distance to measure distribution distance. This solves the issue of defining support and choosing a number of bins. The number of bins can be given as an input to speed up calculations. Big numbers of bins can take a long time to calculate.
- Parameters:
x – (np.array/pd.Series) X vector.
y – (np.array/pd.Series) Y vector (same number of observations as X).
theta – (float) Type of information being tested. Falls in range [0, 1].
n_bins – (int) Number of bins to use to split the X and Y vector observations. (100 by default)
- Returns:
(float) Distance under GNPR approach.