Note

The following documentation closely follows a book by Simão Moraes Sarmento, and Nuno Horta: “A Machine Learning based Pairs Trading Investment Strategy”.

Quantile Time Series Strategy

The authors propose a model based on the predicted future data - price spread. Originally, three approaches were proposed and the most suitable, based on the empirical testing, was chosen.

The first approach is based on modeling the spread directly from the prices of two legs composing a pair, later if a predicted return is higher than a predefined threshold, a trade is made. The second approach is to look at the trends of the spread series and use a momentum strategy to trade the spread. However, the authors find that these two approaches performed poorly in practice. The approach that worked the best in empirical testing consisted of first forecasting the future spread values and then generating the trading signals based on the difference between the forecasted and actual values. The expectation in this method is that the investor can benefit from an abrupt movement of the spread value.

Auto ARIMA

This strategy needs a forecasting algorithm to work. In general, the algorithms for this purpose can be subdivided into two categories - parametric and non-parametric.

The former supposes that the underlying process has a particular structure that can be described with a small number of parameters. However, the models in this approach have their limitations. The latter make no structural assumptions about the underlying structure of the process. These can be Artificial Neural Networks.

Currently, in this module, the ARIMA model with automatically fit parameters is used to forecast the spread values. As the input price series to this strategy are required to be cointegrated. This implies that the spread is stationary (during the formation period) and therefore the process can be described with a simpler model.

The ARIMA model describes a stochastic process as a composition of polynomials. The first polynomial called the autoregression (AR) is regressing the variable at time \(t\) on its own lagged values. The second polynomial is the moving average (MA) and it’s modeling the prediction error as a linear combination of lagged error terms and a time series expected value. The integrated (I) part of the model denotes the differences of series for stationarity. For our purposes, the input series should already be stationary.

The ARIMA(p,d,q) model can be represented as:

\[x_{t} = c + \epsilon_{t} + \sum_{i=1}^{p} \phi_{i} x_{t-1} + \sum_{i=1}^{q} \Theta_{i} \epsilon_{t-i}\]

Where the \(c\) included a constant and a mean value of the \(x_{t}\) series. The \(\epsilon_{t}, \epsilon_{t-1}, ..., \epsilon_{t-q}\) are random variables corresponding to white noise error terms in the corresponding time instances. The \(\phi_{t}, ..., \phi_{p}, \Theta_{1}, ..., \Theta_{q}\) are the model parameters.

If the input series are cointegrated of order zero, then \(x_{t} = X_{t}\), where \(X_{t}\) is the input time series. If the cointegration order is one, then \(x_{t} = X_{t} - X_{t-1}\) and so on.

The best fitting ARIMA model is chosen using the Akaike Information Criterion (AIC):

\[AIC = 2k - 2ln(L)\]

Where \(k\) is the number of moel parameters and \(L\) is the likelihood function.

../_images/auto_arima_prediction.png

An example showing predicted spread values using the Auto ARIMA approach and real spread values.

Implementation

class AutoARIMAForecast(start_p: int = 0, start_q: int = 0, max_p: int = 5, max_q: int = 5)

Auto ARIMA forecast generator function.

__init__(start_p: int = 0, start_q: int = 0, max_p: int = 5, max_q: int = 5)

Init AutoARIMA (p, i, q) prediction class.

Parameters:
  • start_p – (int) Starting value of p (number of time lags) to search in auto ARIMA procedure.

  • start_q – (int) Starting value of q (moving average window) to search in auto ARIMA procedure.

  • max_p – (int) Maximum possible value of p.

  • max_q – (int) Maximum possible value of q.

get_best_arima_model(y_train: Series, verbose: bool = False, silence_warnings: bool = True)

Using the AIC approach from pmdarima library, choose the best fit ARIMA(d, p, q) parameters.

Parameters:
  • y_train – (pd.Series) Training series.

  • verbose – (bool) Flag to print model fit logs.

  • silence_warnings – (bool) Flag to silence warnings from the Auto ARIMA model - convergence warnings etc.

predict(y: Series, retrain_freq: int = 1, train_window: int | None = None, silence_warnings: bool = True) Series

Predict out-of-sample series using already fit ARIMA model. The algorithm retrains the model with retrain_freq either by appending new observations to train data (train_window = None) or by using the latest train_window observations + latest out-of-sample observations y.

Parameters:
  • y – (pd.Series) Out-of-sample series (used to generate rolling forecast).

  • retrain_freq – (int) Model retraining frequency. Model is fit on every train_freq step.

  • train_window – (int) Number of data points from train dataset used in model retrain. If None, use all train set.

  • silence_warnings – (bool) Flag to silence warnings from the Auto ARIMA model - convergence warnings etc.

Returns:

(pd.Series) Series of forecasted values.

Quantile Time Series Strategy

In this section, the method is described in more detail. The \(S_{t}\) and \(S^{*}_{t}\) are defined as the true and predicted values of the spread at time \(t\). The signals generation is based on the predicted change:

\[\delta_{t+1} = \frac{S^{*}_{t+1} - S_{t}}{S_{t}} * 100\]

Market entry conditions are based on the thresholds (\(\alpha_{L}, \alpha_{S}\)):

\[\begin{split}Position = \begin{cases} \begin{split} Long &\text{, if } \Delta_{t+1} \ge \alpha_{L} \\ Short &\text{, if } \Delta_{t+1} \le \alpha_{S} \\ No Position &\text{, otherwise.} \end{split} \end{cases}\end{split}\]

One approach that can be taken is to solve a maximization problem to find the entry and exit conditions that would result in a maximum profit. However, this would result in data-snooping and adding extra complexity, as mentioned by the authors.

The approach that is used in this strategy is based on picking the quantiles of the percentage change distribution during a formation period. First, the spread percentages at any time \(t\) are calculated as:

\[x_{t} = \frac{S_{t} - S_{t-1}}{S_{t-1}} * 100\]

Next, for \(f(x)\) - the distribution of percentage changes, the negative and the positive changes are considered separately to get the needed quantile values:

../_images/quantile_thresholds.png

Spread percentage change distributions. An example from the book by Simão Moraes Sarmento, and Nuno Horta “A Machine Learning based Pairs Trading Investment Strategy”.

Choosing quantiles as thresholds fits the idea of targeting the abrupt changes that occur frequently enough. The authors recommend picking either 10% or 20% quantiles for thresholds. The illustration of the quantile time series strategy is provided below.

../_images/trading_example.png

Proposed forecasting-based strategy. An example from the book by Simão Moraes Sarmento, and Nuno Horta “A Machine Learning based Pairs Trading Investment Strategy”.

On the above image, the triangles represent the times when one of the thresholds has been triggered and the squares provide the information regarding the predicted direction.

../_images/model_diagram.png

Model diagram as presented in the book by Simão Moraes Sarmento, and Nuno Horta “A Machine Learning based Pairs Trading Investment Strategy”.

Implementation

class QuantileTimeSeriesTradingStrategy(long_quantile: float = 0.9, short_quantile: float = 0.1)

The strategy which implements a quantile-based time series approach in mean-reversion trading. First, we define top quantile of positive spread (y) differences and bottom quantile of negative spread differences. Secondly, we use time series spread prediction y_hat (it can be user-specified prediction, ARIMA, ANN, RNN, etc.) We enter a position if y_hat - y <= bottom quantile or y_hat - y >= top quantile.

This strategy is described in “A Machine Learning based Pairs Trading Investment Strategy” (pages 37-43) by Simão Moraes Sarmento and Nuno Horta.

__init__(long_quantile: float = 0.9, short_quantile: float = 0.1)

Class constructor.

Parameters:
  • long_quantile – (float) Positive spread differences quantile used as long entry threshold.

  • short_quantile – (float) Negative spread differences quantile used as short entry threshold.

fit_thresholds(spread_series: Series)

Define quantile-based long/short difference thresholds from spread series.

Parameters:

spread_series – (pd.Series) Spread series used to fit thresholds.

get_allocation(predicted_difference: float, exit_threshold: float = 0) int

Get target allocation (-1, 0, 1) based on current spread value, predicted value, and exit threshold. -1/1 means either to open a new short/long position or stay in a long/short trade (if the position has been already opened). 0 means exit the position.

Parameters:
  • predicted_difference – (float) Spread predicted value - current spread value

  • exit_threshold – (float) Difference between predicted and current value threshold to close the trade.

Returns:

(int) Trade signal: -1 (short), 0 (exit current position/stay in cash), 1(long).

plot_thresholds()

Plot KDE-plots of positive and negative differences vs long/short thresholds.

Returns:

(plt.axes) The KDE plot.

Examples

Code Example

>>> import pandas as pd
>>> from arbitragelab.time_series_approach.arima_predict import AutoARIMAForecast
>>> from arbitragelab.time_series_approach.quantile_time_series import (
...     QuantileTimeSeriesTradingStrategy,
... )
>>> # Load Non-negative Crude and Gasoline futures data.
>>> wti_contract_url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/arbitrage_lab_data/NonNegative_CL_forward_roll.csv"
>>> rbob_contract_url = "https://raw.githubusercontent.com/hudson-and-thames/example-data/main/arbitrage_lab_data/NonNegative_nRB_forward_roll.csv"
>>> wti_contract_df = pd.read_csv(wti_contract_url, parse_dates=["Dates"]).set_index(
...     "Dates"
... )
>>> rbob_contract_df = pd.read_csv(rbob_contract_url, parse_dates=["Dates"]).set_index(
...     "Dates"
... )
>>> working_df = pd.concat([wti_contract_df, rbob_contract_df], axis=1)
>>> # working_df.index = pd.to_datetime(working_df.index)
>>> working_df.columns = ["wti", "gasoline"]
>>> working_df.dropna(inplace=True)
>>> # Calculate naive spread between gasoline and wti.
>>> spread = working_df["gasoline"] - working_df["wti"]
>>> # Split the dataset into two parts -- the first one for model training...
>>> data_model_fitting = spread.loc[:"2019-01-01"]
>>> # ...and the second one for signals generation
>>> data_signals_generation = spread.loc["2019-01-01":]
>>> # Create the ARIMA model
>>> arima_model = AutoARIMAForecast(start_p=1, start_q=1, max_p=10, max_q=10)
>>> # Finding the best fitting model
>>> arima_model.get_best_arima_model(y_train=data_model_fitting)
>>> # Getting the thresholds at 20% and 80% quantiles
>>> time_series_trading = QuantileTimeSeriesTradingStrategy(
...     long_quantile=0.8, short_quantile=0.2
... )
>>> # Calculate the thresholds for the data
>>> time_series_trading.fit_thresholds(data_model_fitting)
>>> # Plot thresholds used for trading
>>> fig = time_series_trading.plot_thresholds()
>>> fig  
array(...)
>>> # Generate out-of-sample ARIMA prediction
>>> oos_prediction = arima_model.predict(
...     y=data_signals_generation, silence_warnings=True
... )  
Progress...
>>> # Use the difference between prediction and actual value to trade the spread
>>> allocations = []
>>> for prediction, actual in zip(oos_prediction, data_signals_generation):
...     allocation = time_series_trading.get_allocation(
...         predicted_difference=prediction - actual, exit_threshold=0
...     )
...     allocations.append(allocation)
...
>>> allocations[:3]
[0, 0, 0]
>>> # Get the trading signals created using quantile time series strategy
>>> positions = pd.Series(
...     index=data_signals_generation.index, data=time_series_trading.positions
... )
>>> positions[:3]  
Dates
2019-01-01    0
2019-01-02    0
2019-01-03    0
dtype: int64

Research Notebooks

The following research notebook can be used to better understand the time series approach described above.

References