Batch

This module allows automatic splitting of a DataFrame into smaller DataFrames (by clusters of columns) and doing model training and text generation on each sub-DF independently.

Then we can concat each sub-DF back into one final synthetic dataset.

For example usage, please see our Jupyter Notebook.

class gretel_synthetics.batch.Batch(checkpoint_dir: str, input_data_path: str, headers: List[str], config: gretel_synthetics.config.TensorFlowConfig, gen_data_count: int = 0)

A representation of a synthetic data workflow. It should not be used directly. This object is created automatically by the primary batch handler, such as DataFrameBatch. This class holds all of the necessary information for training, data generation and DataFrame re-assembly.

add_valid_data(data: gretel_synthetics.generate.GenText)

Take a gen_text object and add the generated line to the generated data stream

get_validator()

If a custom validator is set, we return that. Otherwise, we return the built-in validator, which simply checks if a generated line has the right number of values based on the number of headers for this batch.

This at least makes sure the resulting DataFrame will be the right shape

load_validator_from_file()

Load a saved validation object if it exists

reset_gen_data()

Reset all objects that accumulate or track synthetic data generation

set_validator(fn: Callable, save=True)

Assign a validation callable to this batch. Optionally pickling and saving the validator for loading later

property synthetic_df

Get a DataFrame constructed from the generated lines

class gretel_synthetics.batch.DataFrameBatch(*, df: pandas.core.frame.DataFrame = None, batch_size: int = 15, batch_headers: List[List[str]] = None, config: Union[dict, gretel_synthetics.config.BaseConfig] = None, tokenizer: gretel_synthetics.tokenizers.BaseTokenizerTrainer = None, mode: str = 'write', checkpoint_dir: str = None)

Create a multi-batch trainer / generator. When created, the directory structure to store models and training data will automatically be created. The directory structure will be created under the “checkpoint_dir” location provided in the config template. There will be one directory per batch, where each directory will be called “batch_N” where N is the batch number, starting from 0.

Training and generating can happen per-batch or we can loop over all batches to do both train / generation functions.

Example

When creating this object, you must explicitly create the training data from the input DataFrame before training models:

my_batch = DataFrameBatch(df=my_df, config=my_config)
my_batch.create_training_data()
my_batch.train_all_batches()
Parameters
  • df – The input, source DataFrame

  • batch_size – If batch_headers is not provided we automatically break up the number of columns in the source DataFrame into batches of N columns.

  • batch_headers – A list of lists of strings can be provided which will control the number of batches. The number of inner lists is the number of batches, and each inner list represents the columns that belong to that batch

  • config – A template training config to use, this will be used as kwargs for each Batch’s synthetic configuration. This may also be a sucblass of BaseConfig. If this is used, you can set the input_data_path param to the constant PATH_HOLDER as it does not really matter

  • tokenizer_class – An optional BaseTokenizerTrainer subclass. If not provided the default tokenizer will be used for the underlying ML engine.

Note

When providing a config, the source of training data is not necessary, only the checkpoint_dir is needed. Each batch will control its input training data path after it creates the training dataset.

batch_size: int = None

The max number of columns allowed for a single DF batch

batch_to_df(batch_idx: int) → pandas.core.frame.DataFrame

Extract a synthetic data DataFrame from a single batch.

Parameters

batch_idx – The batch number

Returns

A DataFrame with synthetic data

batches: Dict[int, Batch] = None

A mapping of Batch objects to a batch number. The batch number (key) increments from 0..N where N is the number of batches being used.

batches_to_df() → pandas.core.frame.DataFrame

Convert all batches to a single synthetic data DataFrame.

Returns

A single DataFrame that is the concatenation of all the batch DataFrames.

config: Union[dict, BaseConfig] = None

The template config that will be used for all batches. If a dict is provided we default to a TensorFlowConfig.

create_training_data()

Split the original DataFrame into N smaller DataFrames. Each smaller DataFrame will have the same number of rows, but a subset of the columns from the original DataFrame.

This method iterates over each Batch object and assigns a smaller training DataFrame to the training_df attribute of the object.

Finally, a training CSV is written to disk in the specific batch directory

generate_all_batch_lines(max_invalid=1000, raise_on_failed_batch: bool = False, num_lines: int = None, seed_fields: Union[dict, List[dict]] = None, parallelism: int = 0) → dict

Generate synthetic lines for all batches. Lines for each batch are added to the individual Batch objects. Once generateion is done, you may re-assemble the dataset into a DataFrame.

Example:

my_batch.generate_all_batch_lines()
# Wait for all generation to complete
synthetic_df = my_batch.batches_to_df()
Parameters
  • max_invalid – The number of invalid lines, per batch. If this number is exceeded for any batch, generation will stop.

  • raise_on_failed_batch – If True, then an exception will be raised if any single batch fails to generate the requested number of lines. If False, then the failed batch will be set to False in the result dictionary from this method.

  • num_lines

    The number of lines to create from each batch. If None then the value from the config template will be used.

    Note

    Will be overridden / ignored if seed_fields is a list. Will be set to the len of the list.

  • seed_fields

    A dictionary that maps field/column names to initial seed values for those columns. This seed will only apply to the first batch that gets trained and generated. Additionally, the fields provided in the mapping MUST exist at the front of the first batch.

    Note

    This param may also be a list of dicts. If this is the case, then num_lines will automatically be set to the list length downstream, and a 1:1 ratio will be used for generating valid lines for each prefix.

  • parallelism – The number of concurrent workers to use. 1 (the default) disables parallelization, while a non-positive value means “number of CPUs + x” (i.e., use 0 for using as many workers as there are CPUs). A floating-point value is interpreted as a fraction of the available CPUs, rounded down.

Returns

A dictionary of batch number to a bool value that shows if each batch was able to generate the full number of requested lines:

{
    0: True,
    1: True
}

generate_batch_lines(batch_idx: int, max_invalid=1000, raise_on_exceed_invalid: bool = False, num_lines: int = None, seed_fields: Union[dict, List[dict]] = None, parallelism: int = 0) → bool

Generate lines for a single batch. Lines generated are added to the underlying Batch object for each batch. The lines can be accessed after generation and re-assembled into a DataFrame.

Parameters
  • batch_idx – The batch number

  • max_invalid – The max number of invalid lines that can be generated, if this is exceeded, generation will stop

  • raise_on_exceed_invalid – If true and if the number of lines generated exceeds the max_invalid amount, we will re-raise the error thrown by the generation module which will interrupt the running process. Otherwise, we will not raise the caught exception and just return False indicating that the batch failed to generate all lines.

  • num_lines – The number of lines to generate, if None, then we use the number from the batch’s config

  • seed_fields

    A dictionary that maps field/column names to initial seed values for those columns. This seed will only apply to the first batch that gets trained and generated. Additionally, the fields provided in the mapping MUST exist at the front of the first batch.

    Note

    This param may also be a list of dicts. If this is the case, then num_lines will automatically be set to the list length downstream, and a 1:1 ratio will be used for generating valid lines for each prefix.

  • parallelism – The number of concurrent workers to use. 1 (the default) disables parallelization, while a non-positive value means “number of CPUs + x” (i.e., use 0 for using as many workers as there are CPUs). A floating-point value is interpreted as a fraction of the available CPUs, rounded down.

master_header_list: List[str] = None

During training, this is the original column order. When reading from disk, we concatenate all headers from all batches together. This list is not guaranteed to preserve the original header order.

original_headers: List[str] = None

Stores the original header list / order from the original training data that was used. This is written out to the model directory during training and loaded back in when using read-only mode.

set_batch_validator(batch_idx: int, validator: Callable)

Set a validator for a specific batch. If a validator is configured for a batch, each generated record from that batch will be sent to the validator.

Parameters
  • batch_idx – The batch number .

  • validator – A callable that should take exactly one argument, which will be the raw line generated from the generate_text function.

train_all_batches()

Train a model for each batch.

train_batch(batch_idx: int)

Train a model for a single batch. All model information will be written into that batch’s directory.

Parameters

batch_idx – The index of the batch, from the batches dictionary