arbitragelab.util.generate_dataset

This module generates synthetic classification dataset of INFORMED, REDUNDANT, and NOISE explanatory variables based on the book Machine Learning for Asset Manager (code snippet 6.1)

Module Contents

Functions

get_classification_data([n_features, n_informative, ...])

A function to generate synthetic classification data sets.

get_classification_data(n_features=100, n_informative=25, n_redundant=25, n_samples=10000, random_state=0, sigma=0.0)

A function to generate synthetic classification data sets.

Parameters:
  • n_features – (int) Total number of features to be generated (i.e. informative + redundant + noisy).

  • n_informative – (int) Number of informative features.

  • n_redundant – (int) Number of redundant features.

  • n_samples – (int) Number of samples (rows) to be generate.

  • random_state – (int) Random seed.

  • sigma – (float) This argument is used to introduce substitution effect to the redundant features in the dataset by adding gaussian noise. The lower the value of sigma, the greater the substitution effect.

Returns:

(pd.DataFrame, pd.Series) X and y as features and labels respectively.