Batch¶
This module allows automatic splitting of a DataFrame into smaller DataFrames (by clusters of columns) and doing model training and text generation on each sub-DF independently.
Then we can concat each sub-DF back into one final synthetic dataset.
For example usage, please see our Jupyter Notebook.
-
class
gretel_synthetics.batch.
Batch
(checkpoint_dir: str, input_data_path: str, headers: List[str], config: gretel_synthetics.config.TensorFlowConfig, gen_data_count: int = 0)¶ A representation of a synthetic data workflow. It should not be used directly. This object is created automatically by the primary batch handler, such as
DataFrameBatch
. This class holds all of the necessary information for training, data generation and DataFrame re-assembly.-
add_valid_data
(data: gretel_synthetics.generate.GenText)¶ Take a
gen_text
object and add the generated line to the generated data stream
-
get_validator
()¶ If a custom validator is set, we return that. Otherwise, we return the built-in validator, which simply checks if a generated line has the right number of values based on the number of headers for this batch.
This at least makes sure the resulting DataFrame will be the right shape
-
load_validator_from_file
()¶ Load a saved validation object if it exists
-
reset_gen_data
()¶ Reset all objects that accumulate or track synthetic data generation
-
set_validator
(fn: Callable, save=True)¶ Assign a validation callable to this batch. Optionally pickling and saving the validator for loading later
-
property
synthetic_df
¶ Get a DataFrame constructed from the generated lines
-
-
class
gretel_synthetics.batch.
DataFrameBatch
(*, df: pandas.core.frame.DataFrame = None, batch_size: int = 15, batch_headers: List[List[str]] = None, config: Union[dict, gretel_synthetics.config.BaseConfig] = None, tokenizer: gretel_synthetics.tokenizers.BaseTokenizerTrainer = None, mode: str = 'write', checkpoint_dir: str = None)¶ Create a multi-batch trainer / generator. When created, the directory structure to store models and training data will automatically be created. The directory structure will be created under the “checkpoint_dir” location provided in the
config
template. There will be one directory per batch, where each directory will be called “batch_N” where N is the batch number, starting from 0.Training and generating can happen per-batch or we can loop over all batches to do both train / generation functions.
Example
When creating this object, you must explicitly create the training data from the input DataFrame before training models:
my_batch = DataFrameBatch(df=my_df, config=my_config) my_batch.create_training_data() my_batch.train_all_batches()
- Parameters
df – The input, source DataFrame
batch_size – If
batch_headers
is not provided we automatically break up the number of columns in the source DataFrame into batches of N columns.batch_headers – A list of lists of strings can be provided which will control the number of batches. The number of inner lists is the number of batches, and each inner list represents the columns that belong to that batch
config – A template training config to use, this will be used as kwargs for each Batch’s synthetic configuration. This may also be a sucblass of
BaseConfig
. If this is used, you can set theinput_data_path
param to the constantPATH_HOLDER
as it does not really mattertokenizer_class – An optional
BaseTokenizerTrainer
subclass. If not provided the default tokenizer will be used for the underlying ML engine.
Note
When providing a config, the source of training data is not necessary, only the
checkpoint_dir
is needed. Each batch will control its input training data path after it creates the training dataset.-
batch_size
: int = None¶ The max number of columns allowed for a single DF batch
-
batch_to_df
(batch_idx: int) → pandas.core.frame.DataFrame¶ Extract a synthetic data DataFrame from a single batch.
- Parameters
batch_idx – The batch number
- Returns
A DataFrame with synthetic data
-
batches
: Dict[int, Batch] = None¶ A mapping of
Batch
objects to a batch number. The batch number (key) increments from 0..N where N is the number of batches being used.
-
batches_to_df
() → pandas.core.frame.DataFrame¶ Convert all batches to a single synthetic data DataFrame.
- Returns
A single DataFrame that is the concatenation of all the batch DataFrames.
-
config
: Union[dict, BaseConfig] = None¶ The template config that will be used for all batches. If a dict is provided we default to a TensorFlowConfig.
-
create_training_data
()¶ Split the original DataFrame into N smaller DataFrames. Each smaller DataFrame will have the same number of rows, but a subset of the columns from the original DataFrame.
This method iterates over each
Batch
object and assigns a smaller training DataFrame to thetraining_df
attribute of the object.Finally, a training CSV is written to disk in the specific batch directory
-
generate_all_batch_lines
(max_invalid=1000, raise_on_failed_batch: bool = False, num_lines: int = None, seed_fields: Union[dict, List[dict]] = None, parallelism: int = 0) → dict¶ Generate synthetic lines for all batches. Lines for each batch are added to the individual
Batch
objects. Once generateion is done, you may re-assemble the dataset into a DataFrame.Example:
my_batch.generate_all_batch_lines() # Wait for all generation to complete synthetic_df = my_batch.batches_to_df()
- Parameters
max_invalid – The number of invalid lines, per batch. If this number is exceeded for any batch, generation will stop.
raise_on_failed_batch – If True, then an exception will be raised if any single batch fails to generate the requested number of lines. If False, then the failed batch will be set to
False
in the result dictionary from this method.num_lines –
The number of lines to create from each batch. If
None
then the value from the config template will be used.Note
Will be overridden / ignored if
seed_fields
is a list. Will be set to the len of the list.seed_fields –
A dictionary that maps field/column names to initial seed values for those columns. This seed will only apply to the first batch that gets trained and generated. Additionally, the fields provided in the mapping MUST exist at the front of the first batch.
Note
This param may also be a list of dicts. If this is the case, then
num_lines
will automatically be set to the list length downstream, and a 1:1 ratio will be used for generating valid lines for each prefix.parallelism – The number of concurrent workers to use.
1
(the default) disables parallelization, while a non-positive value means “number of CPUs + x” (i.e., use0
for using as many workers as there are CPUs). A floating-point value is interpreted as a fraction of the available CPUs, rounded down.
- Returns
A dictionary of batch number to a bool value that shows if each batch was able to generate the full number of requested lines:
{ 0: True, 1: True }
-
generate_batch_lines
(batch_idx: int, max_invalid=1000, raise_on_exceed_invalid: bool = False, num_lines: int = None, seed_fields: Union[dict, List[dict]] = None, parallelism: int = 0) → bool¶ Generate lines for a single batch. Lines generated are added to the underlying
Batch
object for each batch. The lines can be accessed after generation and re-assembled into a DataFrame.- Parameters
batch_idx – The batch number
max_invalid – The max number of invalid lines that can be generated, if this is exceeded, generation will stop
raise_on_exceed_invalid – If true and if the number of lines generated exceeds the
max_invalid
amount, we will re-raise the error thrown by the generation module which will interrupt the running process. Otherwise, we will not raise the caught exception and just returnFalse
indicating that the batch failed to generate all lines.num_lines – The number of lines to generate, if
None
, then we use the number from the batch’s configseed_fields –
A dictionary that maps field/column names to initial seed values for those columns. This seed will only apply to the first batch that gets trained and generated. Additionally, the fields provided in the mapping MUST exist at the front of the first batch.
Note
This param may also be a list of dicts. If this is the case, then
num_lines
will automatically be set to the list length downstream, and a 1:1 ratio will be used for generating valid lines for each prefix.parallelism – The number of concurrent workers to use.
1
(the default) disables parallelization, while a non-positive value means “number of CPUs + x” (i.e., use0
for using as many workers as there are CPUs). A floating-point value is interpreted as a fraction of the available CPUs, rounded down.
-
master_header_list
: List[str] = None¶ During training, this is the original column order. When reading from disk, we concatenate all headers from all batches together. This list is not guaranteed to preserve the original header order.
-
original_headers
: List[str] = None¶ Stores the original header list / order from the original training data that was used. This is written out to the model directory during training and loaded back in when using read-only mode.
-
set_batch_validator
(batch_idx: int, validator: Callable)¶ Set a validator for a specific batch. If a validator is configured for a batch, each generated record from that batch will be sent to the validator.
- Parameters
batch_idx – The batch number .
validator – A callable that should take exactly one argument, which will be the raw line generated from the
generate_text
function.
-
train_all_batches
()¶ Train a model for each batch.
-
train_batch
(batch_idx: int)¶ Train a model for a single batch. All model information will be written into that batch’s directory.
- Parameters
batch_idx – The index of the batch, from the
batches
dictionary