`arbitragelab.ml_approach.optics_dbscan_pairs_clustering`

This module implements the ML based Pairs Selection Framework described by Simão Moraes Sarmento and Nuno Horta in “A Machine Learning based Pairs Trading Investment Strategy.”.

Module Contents

Classes

OPTICSDBSCANPairsClustering

Implementation of the Proposed Pairs Selection Framework in the following paper:

class OPTICSDBSCANPairsClustering(universe: pandas.DataFrame)

Implementation of the Proposed Pairs Selection Framework in the following paper: “A Machine Learning based Pairs Trading Investment Strategy.”.

The method consists of 2 parts: dimensionality reduction and clustering of features.

dimensionality_reduction_by_components(num_features: int = 10)

Processes and scales the prices universe supplied in the constructor, into returns.

Then reduces the resulting data using pca down to the amount of dimensions needed to be used as a feature vector in the clustering step. Optimal ranges for the dimensions required in the feature vector should be <15.

Parameters:: num_features – (int) Used to select pca n_components to be used in the feature vector.

plot_pca_matrix(alpha: float = 0.2, figsize: tuple = (15, 15)) → list

Plots the feature vector on a scatter matrix.

Parameters:

alpha – (float) Opacity level to be used in the plot.
figsize – (tuple) Tuple describing the size of the plot.

Returns:

(list) List of Axes objects.

cluster_using_optics(**kwargs: dict) → list

Doing Unsupervised Learning on the feature vector supplied from the first step. The clustering method used is OPTICS, chosen mainly for it being basically parameterless.

Parameters:: kwargs – (dict) Arguments to be passed to the clustering algorithm.

cluster_using_dbscan(**kwargs: dict) → list

Doing Unsupervised Learning on the feature vector supplied from the first step. The second clustering method used is DBSCAN, for when the user needs a more hands-on approach to doing the clustering step, given the parameter sensitivity of this method.

Parameters:: kwargs – (dict) Arguments to be passed to the clustering algorithm.

plot_clustering_info(n_dimensions: int = 2, method: str = '', figsize: tuple = (10, 10)) → matplotlib.axes._axes.Axes

Reduces the feature vector found in the dimensionality reduction step, further down to the specified ‘n_dimensions’ argument using TSNE and then plots the clusters found, on a scatter plot.

Parameters:

n_dimensions – (int) Selected dimension to be used in the T-SNE plot.
method – (str) String to be used as title in the plot.
figsize – (tuple) Tuple describing the size of the plot.

Returns:

(Axes) Axes object.

plot_3d_scatter_plot(tsne_df: pandas.DataFrame, no_of_classes: int, method: str = '') → matplotlib.axes._axes.Axes

Plots the clusters found on a 3d scatter plot. In this method it is assumed that the data being plotted has been pre-processed using TSNE constrained to three components to provide the best visualization of dataset possible.

Parameters:

tsne_df – (pd.DataFrame) Data reduced using T-SNE.
no_of_classes – (int) Number of unique clusters/classes.
method – (str) String to be used as title in the plot.

Returns:

(Axes) Axes object.

plot_2d_scatter_plot(fig: matplotlib.figure.Figure, tsne_df: pandas.DataFrame, no_of_classes: int, method: str = '') → matplotlib.axes._axes.Axes

Plots the clusters found on a 2d scatter plot.

Parameters:

fig – (Figure) Figure object, needed for the styling of the spline.
tsne_df – (pd.DataFrame) Data reduced using T-SNE.
no_of_classes – (int) Number of unique clusters/classes.
method – (str) String to be used as title in the plot.

Returns:

(Axes) Axes object.

plot_knee_plot() → matplotlib.axes._axes.Axes

This method will plot the k-distance graph, ordered from the largest to the smallest value.

The values where this plot shows an “elbow” should be a reference to the user of the optimal ε parameter to be used for the DBSCAN clustering method.

Returns:: (Axes) Axes object.

static get_pairs_by_sector(sectoral_info: pandas.DataFrame) → list

This method will loop through all the tickers tagged by sector and generate pairwise combinations of the assets for each sector.

Parameters:: sectoral_info – (pd.DataFrame) List of asset name pairs to be analyzed tagged with respective sector.
Returns:: (list) List of asset name pairs.

arbitragelab.ml_approach.optics_dbscan_pairs_clustering

Module Contents

Classes

`arbitragelab.ml_approach.optics_dbscan_pairs_clustering`