Config

This module provides a set of dataclasses that can be used to hold all necessary confguration parameters for training a model and generating data.

For example usage please see our Jupyter Notebooks.

class gretel_synthetics.config.BaseConfig(max_lines: int = 0, epochs: int = 100, early_stopping: bool = True, early_stopping_patience: int = 5, best_model_metric: str = 'loss', batch_size: int = 64, buffer_size: int = 10000, seq_length: int = 100, embedding_dim: int = 256, rnn_units: int = 256, dropout_rate: float = 0.2, rnn_initializer: str = 'glorot_uniform', field_delimiter: Optional[str] = None, field_delimiter_token: str = '<d>', vocab_size: int = 20000, character_coverage: float = 1.0, pretrain_sentence_count: int = 1000000, max_line_len: int = 2048, dp: bool = False, dp_learning_rate: float = 0.001, dp_noise_multiplier: float = 1.1, dp_l2_norm_clip: float = 1.0, dp_microbatches: int = 256, gen_temp: float = 1.0, gen_chars: int = 0, gen_lines: int = 1000, predict_batch_size: int = 64, save_all_checkpoints: bool = False, save_best_model: bool = True, overwrite: bool = False)

Base dataclass that contains all of the main parameters for training a model and generating data. This base config generally should not be used directly. Instead you should use one of the subclasses which are specific to model and checkpoint storage.

Parameters
  • max_lines (optional) – Number of rows of file to read. Useful for training on a subset of large files. If unspecified, max_lines will default to 0 (process all lines).

  • max_line_len (optional) – Maximum line length for input training data. Any lines longer than this length will be ignored. Default is 2048.

  • epochs (optional) – Number of epochs to train the model. An epoch is an iteration over the entire training set provided. For production use cases, 15-50 epochs are recommended. The default is 100 and is intentionally set extra high. By default, early_stopping is also enabled and will stop training epochs once the model is no longer improving.

  • early_stopping (optional) – deduce when the model is no longer improving and terminating training.

  • early_stopping_patience (optional) – in the model. After this number of epochs, training will terminate.

  • best_model_metric (optional) – longer improving. Defaults to the loss value. An alternative option is “accuracy.” A error will be raised if either of this values are not used.

  • batch_size (optional) – Number of samples per gradient update. Using larger batch sizes can help make more efficient use of CPU/GPU parallelization, at the cost of memory. If unspecified, batch_size will default to 64.

  • buffer_size (optional) – Buffer size which is used to shuffle elements during training. Default size is 10000.

  • seq_length (optional) – The maximum length sentence we want for a single training input in characters. Note that this setting is different than max_line_length, as seq_length simply affects the length of the training examples passed to the neural network to predict the next token. Default size is 100.

  • embedding_dim (optional) – Vector size for the lookup table used in the neural network Embedding layer that maps the numbers of each character. Default size is 256.

  • rnn_units (optional) – Positive integer, dimensionality of the output space for LSTM layers. Default size is 256.

  • dropout_rate (optional) – Float between 0 and 1. Fraction of the units to drop for the linear transformation of the inputs. Using a dropout can help to prevent overfitting by ignoring randomly selected neurons during training. 0.2 (20%) is often used as a good compromise between retaining model accuracy and preventing overfitting. Default is 0.2.

  • rnn_initializer (optional) – Initializer for the kernal weights matrix, used for the linear transformation of the inputs. Default is glorot_transform.

  • optimizer (optional) – Optimizer used by the neural network to maximize accuracy and reduce loss. Currently supported optimizers: Adam, SGD, and Adagrad. Default is Adam.

  • field_delimiter (optional) – Delimiter to use for training on structured data. When specified, the delimiter is passed as a user-specified token to the tokenizer, which can improve synthetic data quality. For unstructured text, leave as None. For structured text such as comma or tab separated values, specify “,” or ” ” respectively. Default is None.

  • field_delimiter_token (optional) – User specified token to replace field_delimiter with while annotating data for training the model. Default is <d>.

  • vocab_size (optional) – Pre-determined maximum vocabulary size prior to neural model training, based on subword units including byte-pair-encoding (BPE) and unigram language model, with the extension of direct training from raw sentences. We generally recommend using a large vocabulary size of 20,000 to 50,000. Default is 20000.

  • character_coverage (optional) – The amount of characters covered by the model. Unknown characters will be replaced with the <unk> tag. Good defaults are 0.995 for languages with rich character sets like Japanese or Chinese, and 1.0 for other languages or machine data. Default is 1.0.

  • pretrain_sentence_count (optional) – The number of lines spm_train first loads. Remaining lines are simply discarded. Since spm_train loads entire corpus into memory, this size will depend on the memory size of the machine. It also affects training time. Default is 1000000.

  • dp (optional) – If True, train model with differential privacy enabled. This setting provides assurances that the models will encode general patterns in data rather than facts about specific training examples. These additional guarantees can usefully strengthen the protections offered for sensitive data and content, at a small loss in model accuracy and synthetic data quality. The differential privacy epsilon and delta values will be printed when training completes. Default is False.

  • dp_learning_rate (optional) – The higher the learning rate, the more that each update during training matters. If the updates are noisy (such as when the additive noise is large compared to the clipping threshold), a low learning rate may help with training. Default is 0.015.

  • dp_noise_multiplier (optional) – The amount of noise sampled and added to gradients during training. Generally, more noise results in better privacy, at the expense of model accuracy. Default is 1.1.

  • dp_l2_norm_clip (optional) – The maximum Euclidean (L2) norm of each gradient is applied to update model parameters. This hyperparameter bounds the optimizer’s sensitivity to individual training points. Default is 1.0.

  • dp_microbatches (optional) – Each batch of data is split into smaller units called micro-batches. Computational overhead can be reduced by increasing the size of micro-batches to include more than one training example. The number of micro-batches should divide evenly into the overall batch_size. Default is 64.

  • gen_temp (optional) – Controls the randomness of predictions by scaling the logits before applying softmax. Low temperatures result in more predictable text. Higher temperatures result in more surprising text. Experiment to find the best setting. Default is 1.0.

  • gen_chars (optional) – Maximum number of characters to generate per line. Default is 0 (no limit).

  • gen_lines (optional) – Maximum number of text lines to generate. This function is used by generate_text and the optional line_validator to make sure that all lines created by the model pass validation. Default is 1000.

  • predict_batch_size (optional) – How many words to generate in parallel. Higher values may result in increased throughput. The default of 64 should provide reasonable performance for most users.

  • save_all_checkpoints (optional) – which can be useful for optimal model selection. Set to False to save only the latest checkpoint. Default is True.

  • save_best_model (optional). Defaults to True. Track the best version of the model (checkpoint) – If save_all_checkpoints is disabled, then the saved model will be overwritten by newer ones only if they are better.

  • overwrite (optional) – If False, the trainer will generate an error if checkpoints exist in the model directory. Default is False.

class gretel_synthetics.config.LocalConfig(paths: gretel_synthetics.config._PathSettings = <factory>, max_lines: int = 0, epochs: int = 100, early_stopping: bool = True, early_stopping_patience: int = 5, best_model_metric: str = 'loss', batch_size: int = 64, buffer_size: int = 10000, seq_length: int = 100, embedding_dim: int = 256, rnn_units: int = 256, dropout_rate: float = 0.2, rnn_initializer: str = 'glorot_uniform', field_delimiter: Optional[str] = None, field_delimiter_token: str = '<d>', vocab_size: int = 20000, character_coverage: float = 1.0, pretrain_sentence_count: int = 1000000, max_line_len: int = 2048, dp: bool = False, dp_learning_rate: float = 0.001, dp_noise_multiplier: float = 1.1, dp_l2_norm_clip: float = 1.0, dp_microbatches: int = 256, gen_temp: float = 1.0, gen_chars: int = 0, gen_lines: int = 1000, predict_batch_size: int = 64, save_all_checkpoints: bool = False, save_best_model: bool = True, overwrite: bool = False, checkpoint_dir: str = None, input_data_path: str = None)

This configuration will use the local file system to store all models, training data, and checkpoints

Parameters
  • checkpoint_dir – The local directory where all checkpoints and additional support files for training and generation will be stored.

  • input_data_path – A path to a file that will be used as initial training input. This file will be opened, annotated, and then written out to a path that is generated based on the checkpoint_dir.