Config

This module provides a set of dataclasses that can be used to hold all necessary confguration parameters for training a model and generating data.

For example usage please see our Jupyter Notebooks.

class gretel_synthetics.config.BaseConfig(input_data_path: str | None = None, validation_split: bool = True, checkpoint_dir: str | None = None, training_data_path: str | None = None, field_delimiter: str | None = None, field_delimiter_token: str = '<d>', model_type: str | None = None, max_lines: int = 0, overwrite: bool = False, epoch_callback: Callable | None = None, max_training_time_seconds: int | None = None, vocab_size: int = 20000, character_coverage: float = 1.0, pretrain_sentence_count: int = 1000000, max_line_len: int = 2048)

This class should not be used directly, engine specific classes should derived from this class.

as_dict(): Serialize the config attrs to a dict

checkpoint_dir: str = None: Directory where model data will be stored, user provided.

epoch_callback: Callable | None = None: Callback to be invoked at the end of each epoch. It will be invoked with an EpochState instance as its only parameter. NOTE that the callback is deleted when save_model_params is called, we do not attempt to serialize it to JSON.

field_delimiter: str | None = None: If the input data is structured, you may specify a field delimiter which can be used to split the generated text into a list of strings. For more detail please see the GenText class in the generate.py module.

field_delimiter_token: str = '<d>': Depending on the tokenizer used, a special token can be used to represent characters. For tokenizers, like SentencePiece that support this, we will replace the field delimiter char with this token to provide better learning and generation. If the tokenizer used does not support custom tokens, this value will be ignored

abstract get_generator_class() → None: This must be implemented by all specific configs. It should return the class that should be used as the Generator for creating records.

abstract get_training_callable() → Callable: This must be implemented by all specific configs. It should return a callable that should be used as the entrypoint for training a model.

gpu_check(): Optionally do a GPU check and warn if a GPU is not available, if not overridden, do nothing

input_data_path: str = None: Path to raw training data, user provided.

max_lines: int = 0: The maximum number of lines to utilize from the raw input data.

max_training_time_seconds: int | None = None: If set, training will cease after the number of seconds specified elapses. This timeout will be evaluated after each epoch.

model_type: str = None: A string version of the model config class. This is used to keep track of what underlying engine was used when writing the config to a file. This will be automatically updated during construction.

overwrite: bool = False: Set to True to automatically overwrite previously saved model checkpoints. If False, the trainer will generate an error if checkpoints exist in the model directory. Default is False.

training_data_path: str = None: Where annotated and tokenized training data will be stored. This attr will be modified during construction.

validation_split: bool = True: Use a fraction of the training data as validation data. Use of a validation set is recommended as it helps prevent over-fitting and memorization. When enabled, 20% of data will be used for validation.

gretel_synthetics.config.CONFIG_MAP = {'TensorFlowConfig': <class 'gretel_synthetics.config.TensorFlowConfig'>}: A mapping of configuration subclass string names to their actual classes. This can be used to re-instantiate a config from a serialized state.

gretel_synthetics.config.LocalConfig: alias of TensorFlowConfig

class gretel_synthetics.config.TensorFlowConfig(input_data_path: str | None = None, validation_split: bool = True, checkpoint_dir: str | None = None, training_data_path: str | None = None, field_delimiter: str | None = None, field_delimiter_token: str = '<d>', model_type: str | None = None, max_lines: int = 0, overwrite: bool = False, epoch_callback: Callable | None = None, max_training_time_seconds: int | None = None, vocab_size: int = 20000, character_coverage: float = 1.0, pretrain_sentence_count: int = 1000000, max_line_len: int = 2048, epochs: int = 100, early_stopping: bool = True, early_stopping_patience: int = 5, best_model_metric: str | None = None, early_stopping_min_delta: float = 0.001, batch_size: int = 64, buffer_size: int = 10000, seq_length: int = 100, embedding_dim: int = 256, rnn_units: int = 256, learning_rate: float = 0.01, dropout_rate: float = 0.2, rnn_initializer: str = 'glorot_uniform', dp: bool = False, dp_noise_multiplier: float = 0.1, dp_l2_norm_clip: float = 3.0, dp_microbatches: int = 1, gen_temp: float = 1.0, gen_chars: int = 0, gen_lines: int = 1000, predict_batch_size: int = 64, reset_states: bool = True, save_all_checkpoints: bool = False, save_best_model: bool = True)

TensorFlow config that contains all of the main parameters for training a model and generating data.

Parameters:

epochs (optional) – Number of epochs to train the model. An epoch is an iteration over the entire training set provided. For production use cases, 15-50 epochs are recommended. The default is 100 and is intentionally set extra high. By default, early_stopping is also enabled and will stop training epochs once the model is no longer improving.
early_stopping (optional) – deduce when the model is no longer improving and terminating training.
early_stopping_patience (optional) – in the model. After this number of epochs, training will terminate.
best_model_metric (optional) – The metric to use to track when a model is no longer improving. Alternative options are “val_acc” or “acc”. A error will be raised if a valid value is not specified.
early_stopping_min_delta (optional) – as an improvement, i.e. an absolute change of less than min_delta will count as no improvement.
batch_size (optional) – Number of samples per gradient update. Using larger batch sizes can help make more efficient use of CPU/GPU parallelization, at the cost of memory. If unspecified, batch_size will default to 64.
buffer_size (optional) – Buffer size which is used to shuffle elements during training. Default size is 10000.
seq_length (optional) – The maximum length sentence we want for a single training input in characters. Note that this setting is different than max_line_length, as seq_length simply affects the length of the training examples passed to the neural network to predict the next token. Default size is 100.
embedding_dim (optional) – Vector size for the lookup table used in the neural network Embedding layer that maps the numbers of each character. Default size is 256.
rnn_units (optional) – Positive integer, dimensionality of the output space for LSTM layers. Default size is 256.
dropout_rate (optional) – Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs. Using a dropout can help to prevent overfitting by ignoring randomly selected neurons during training. 0.2 (20%) is often used as a good compromise between retaining model accuracy and preventing overfitting. Default is 0.2.
rnn_initializer (optional) – Initializer for the kernal weights matrix, used for the linear transformation of the inputs. Default is glorot_transform.
dp (optional) – If True, train model with differential privacy enabled. This setting provides assurances that the models will encode general patterns in data rather than facts about specific training examples. These additional guarantees can usefully strengthen the protections offered for sensitive data and content, at a small loss in model accuracy and synthetic data quality. The differential privacy epsilon and delta values will be printed when training completes. Default is False.
learning_rate (optional) – The higher the learning rate, the more that each update during training matters. Note: When training with differential privacy enabled, if the updates are noisy (such as when the additive noise is large compared to the clipping threshold), a low learning rate may help with training. Default is 0.01.
dp_noise_multiplier (optional) – The amount of noise sampled and added to gradients during training. Generally, more noise results in better privacy, at the expense of model accuracy. Default is 0.1.
dp_l2_norm_clip (optional) – The maximum Euclidean (L2) norm of each gradient is applied to update model parameters. This hyperparameter bounds the optimizer’s sensitivity to individual training points. Default is 3.0.
dp_microbatches (optional) – Each batch of data is split into smaller units called micro-batches. Computational overhead can be reduced by increasing the size of micro-batches to include more than one training example. The number of micro-batches should divide evenly into the overall batch_size. Default is 1.
gen_temp (optional) – Controls the randomness of predictions by scaling the logits before applying softmax. Low temperatures result in more predictable text. Higher temperatures result in more surprising text. Experiment to find the best setting. Default is 1.0.
gen_chars (optional) – Maximum number of characters to generate per line. Default is 0 (no limit).
gen_lines (optional) – Maximum number of text lines to generate. This function is used by generate_text and the optional line_validator to make sure that all lines created by the model pass validation. Default is 1000.
predict_batch_size (optional) – How many words to generate in parallel. Higher values may result in increased throughput. The default of 64 should provide reasonable performance for most users.
reset_states (optional) – Reset RNN model states between each record created guarantees more consistent record creation over time, at the expense of model accuracy. Default is True.
save_all_checkpoints (optional) – which can be useful for optimal model selection. Set to False to save only the latest checkpoint. Default is True.
save_best_model (optional). Defaults to True. Track the best version of the model (checkpoint) – If save_all_checkpoints is disabled, then the saved model will be overwritten by newer ones only if they are better.

get_generator_class(): This must be implemented by all specific configs. It should return the class that should be used as the Generator for creating records.

get_training_callable(): This must be implemented by all specific configs. It should return a callable that should be used as the entrypoint for training a model.

gpu_check(): Optionally do a GPU check and warn if a GPU is not available, if not overridden, do nothing

gretel_synthetics.config.config_from_model_dir(model_dir: str) → BaseConfig

Factory that will take a known directory of a model and return a class instance for that config. We automatically try and detect the correct BaseConfig sub-class to use based on the saved model params.

If there is no model_type param in the saved config, we assume that the model was saved using an earlier version of the package and will instantiate a TensorFlowConfig