arbitragelab.ml_approach.optics_dbscan_pairs_clustering
This module implements the ML based Pairs Selection Framework described by Simão Moraes Sarmento and Nuno Horta in “A Machine Learning based Pairs Trading Investment Strategy.”.
Module Contents
Classes
Implementation of the Proposed Pairs Selection Framework in the following paper: |
- class OPTICSDBSCANPairsClustering(universe: pandas.DataFrame)
Implementation of the Proposed Pairs Selection Framework in the following paper: “A Machine Learning based Pairs Trading Investment Strategy.”.
The method consists of 2 parts: dimensionality reduction and clustering of features.
- dimensionality_reduction_by_components(num_features: int = 10)
Processes and scales the prices universe supplied in the constructor, into returns.
Then reduces the resulting data using pca down to the amount of dimensions needed to be used as a feature vector in the clustering step. Optimal ranges for the dimensions required in the feature vector should be <15.
- Parameters:
num_features – (int) Used to select pca n_components to be used in the feature vector.
- plot_pca_matrix(alpha: float = 0.2, figsize: tuple = (15, 15)) list
Plots the feature vector on a scatter matrix.
- Parameters:
alpha – (float) Opacity level to be used in the plot.
figsize – (tuple) Tuple describing the size of the plot.
- Returns:
(list) List of Axes objects.
- cluster_using_optics(**kwargs: dict) list
Doing Unsupervised Learning on the feature vector supplied from the first step. The clustering method used is OPTICS, chosen mainly for it being basically parameterless.
- Parameters:
kwargs – (dict) Arguments to be passed to the clustering algorithm.
- cluster_using_dbscan(**kwargs: dict) list
Doing Unsupervised Learning on the feature vector supplied from the first step. The second clustering method used is DBSCAN, for when the user needs a more hands-on approach to doing the clustering step, given the parameter sensitivity of this method.
- Parameters:
kwargs – (dict) Arguments to be passed to the clustering algorithm.
- plot_clustering_info(n_dimensions: int = 2, method: str = '', figsize: tuple = (10, 10)) matplotlib.axes._axes.Axes
Reduces the feature vector found in the dimensionality reduction step, further down to the specified ‘n_dimensions’ argument using TSNE and then plots the clusters found, on a scatter plot.
- Parameters:
n_dimensions – (int) Selected dimension to be used in the T-SNE plot.
method – (str) String to be used as title in the plot.
figsize – (tuple) Tuple describing the size of the plot.
- Returns:
(Axes) Axes object.
- plot_3d_scatter_plot(tsne_df: pandas.DataFrame, no_of_classes: int, method: str = '') matplotlib.axes._axes.Axes
Plots the clusters found on a 3d scatter plot. In this method it is assumed that the data being plotted has been pre-processed using TSNE constrained to three components to provide the best visualization of dataset possible.
- Parameters:
tsne_df – (pd.DataFrame) Data reduced using T-SNE.
no_of_classes – (int) Number of unique clusters/classes.
method – (str) String to be used as title in the plot.
- Returns:
(Axes) Axes object.
- plot_2d_scatter_plot(fig: matplotlib.figure.Figure, tsne_df: pandas.DataFrame, no_of_classes: int, method: str = '') matplotlib.axes._axes.Axes
Plots the clusters found on a 2d scatter plot.
- Parameters:
fig – (Figure) Figure object, needed for the styling of the spline.
tsne_df – (pd.DataFrame) Data reduced using T-SNE.
no_of_classes – (int) Number of unique clusters/classes.
method – (str) String to be used as title in the plot.
- Returns:
(Axes) Axes object.
- plot_knee_plot() matplotlib.axes._axes.Axes
This method will plot the k-distance graph, ordered from the largest to the smallest value.
The values where this plot shows an “elbow” should be a reference to the user of the optimal ε parameter to be used for the DBSCAN clustering method.
- Returns:
(Axes) Axes object.
- static get_pairs_by_sector(sectoral_info: pandas.DataFrame) list
This method will loop through all the tickers tagged by sector and generate pairwise combinations of the assets for each sector.
- Parameters:
sectoral_info – (pd.DataFrame) List of asset name pairs to be analyzed tagged with respective sector.
- Returns:
(list) List of asset name pairs.