Batch

This module allows automatic splitting of a DataFrame into smaller DataFrames (by clusters of columns) and doing model training and text generation on each sub-DF independently.

Then we can concat each sub-DF back into one final synthetic dataset.

For example usage, please see our Jupyter Notebook.

class gretel_synthetics.batch.Batch(checkpoint_dir: str, input_data_path: str, headers: List[str], config: gretel_synthetics.config.LocalConfig, gen_data_count: int = 0)

A representation of a synthetic data workflow. It should not be used directly. This object is created automatically by the primary batch handler, such as DataFrameBatch. This class holds all of the necessary information for training, data generation and DataFrame re-assembly.

add_valid_data(data: gretel_synthetics.generator.gen_text)

Take a gen_text object and add the generated line to the generated data stream

get_validator()

If a custom validator is set, we return that. Otherwise, we return the built-in validator, which simply checks if a generated line has the right number of values based on the number of headers for this batch.

This at least makes sure the resulting DataFrame will be the right shape

load_validator_from_file()

Load a saved validation object if it exists

reset_gen_data()

Reset all objects that accumulate or track synthetic data generation

set_validator(fn: Callable, save=True)

Assign a validation callable to this batch. Optionally pickling and saving the validator for loading later

property synthetic_df

Get a DataFrame constructed from the generated lines

class gretel_synthetics.batch.DataFrameBatch(*, df: pandas.core.frame.DataFrame = None, batch_size: int = 15, batch_headers: List[List[str]] = None, config: dict = None, mode: str = 'write', checkpoint_dir: str = None)

Create a multi-batch trainer / generator. When created, the directory structure to store models and training data will automatically be created. The directory structure will be created under the “checkpoint_dir” location provided in the config template. There will be one directory per batch, where each directory will be called “batch_N” where N is the batch number, starting from 0.

Training and generating can happen per-batch or we can loop over all batches to do both train / generation functions.

Example

When creating this object, you must explicitly create the training data from the input DataFrame before training models:

my_batch = DataFrameBatch(df=my_df, config=my_config)
my_batch.create_training_data()
my_batch.train_all_batches()
Parameters
  • df – The input, source DataFrame

  • batch_size – If batch_headers is not provided we automatically break up the number of columns in the source DataFrame into batches of N columns.

  • batch_headers – A list of lists of strings can be provided which will control the number of batches. The number of inner lists is the number of batches, and each inner list represents the columns that belong to that batch

  • config – A template training config to use, this will be used as kwargs for each Batch’s synthetic configuration.

Note

When providing a config, the source of training data is not necessary, only the checkpoint_dir is needed. Each batch will control its input training data path after it creates the training dataset.

batch_to_df(batch_idx: int) → pandas.core.frame.DataFrame

Extract a synthetic data DataFrame from a single batch.

Parameters

batch_idx – The batch number

Returns

A DataFrame with synthetic data

batches: Dict[int, Batch] = None

A mapping of Batch objects to a batch number. The batch number (key) increments from 0..N where N is the number of batches being used.

batches_to_df() → pandas.core.frame.DataFrame

Convert all batches to a single synthetic data DataFrame.

Returns

A single DataFrame that is the concatenation of all the batch DataFrames.

create_training_data()

Split the original DataFrame into N smaller DataFrames. Each smaller DataFrame will have the same number of rows, but a subset of the columns from the original DataFrame.

This method iterates over each Batch object and assigns a smaller training DataFrame to the training_df attribute of the object.

Finally, a training CSV is written to disk in the specific batch directory

generate_all_batch_lines(max_invalid=1000, raise_on_failed_batch: bool = False, num_lines: int = None, parallelism: int = 0) → dict

Generate synthetic lines for all batches. Lines for each batch are added to the individual Batch objects. Once generateion is done, you may re-assemble the dataset into a DataFrame.

Example:

my_batch.generate_all_batch_lines()
# Wait for all generation to complete
synthetic_df = my_batch.batches_to_df()
Parameters
  • max_invalid – The number of invalid lines, per batch. If this number is exceeded for any batch, generation will stop.

  • raise_on_failed_batch – If True, then an exception will be raised if any single batch fails to generate the requested number of lines. If False, then the failed batch will be set to False in the result dictionary from this method.

  • num_lines – The number of lines to create from each batch. If None then the value from the config template will be used.

  • parallelism – The number of concurrent workers to use. 1 (the default) disables parallelization, while a non-positive value means “number of CPUs + x” (i.e., use 0 for using as many workers as there are CPUs). A floating-point value is interpreted as a fraction of the available CPUs, rounded down.

Returns

A dictionary of batch number to a bool value that shows if each batch was able to generate the full number of requested lines:

{
    0: True,
    1: True
}

generate_batch_lines(batch_idx: int, max_invalid=1000, raise_on_exceed_invalid: bool = False, num_lines: int = None, parallelism: int = 0) → bool

Generate lines for a single batch. Lines generated are added to the underlying Batch object for each batch. The lines can be accessed after generation and re-assembled into a DataFrame.

Parameters
  • batch_idx – The batch number

  • max_invalid – The max number of invalid lines that can be generated, if this is exceeded, generation will stop

  • raise_on_exceed_invalid – If true and if the number of lines generated exceeds the max_invalid amount, we will re-raise the error thrown by the generation module which will interrupt the running process. Otherwise, we will not raise the caught exception and just return False indicating that the batch failed to generate all lines.

  • num_lines – The number of lines to generate, if None, then we use the number from the batch’s config

  • parallelism – The number of concurrent workers to use. 1 (the default) disables parallelization, while a non-positive value means “number of CPUs + x” (i.e., use 0 for using as many workers as there are CPUs). A floating-point value is interpreted as a fraction of the available CPUs, rounded down.

set_batch_validator(batch_idx: int, validator: Callable)

Set a validator for a specific batch. If a validator is configured for a batch, each generated record from that batch will be sent to the validator.

Parameters
  • batch_idx – The batch number .

  • validator – A callable that should take exactly one argument, which will be the raw line generated from the generate_text function.

train_all_batches()

Train a model for each batch.

train_batch(batch_idx: int)

Train a model for a single batch. All model information will be written into that batch’s directory.

Parameters

batch_idx – The index of the batch, from the batches dictionary