Tokenizers

Interface definitions for tokenizers. The classes in the module are segmented into two abstract types: Trainers and Tokenizers. They are kept separate because the parameters used to train a tokenizer are not necessarily loaded back in and utilized by a trained tokenizer. While its more explicit to utilize two types of classes, it also removes any ambiguity in which methods are able to be used based on training or tokenizing.

Trainers require a specific configuration to be provided. Based on the configuration received, the tokenizer trainers will create the actual training data file that will be used by the downstream training process. In this respect, utilizing at least one of these tokenizers is required for training since it is the tokenizers responsbility to create the final training data to be used.

The general process that is followed when using these tokenizers is:

Create a trainer instance, with desired parameters, including providing the config as a required param.

Call the annotate_data for your tokenizer trainer. What is important to note here is that this method actually iterates the input data line by line, and does any special processing, then writes a new data file that will be used for actual training. This new data file is written to the model directory.

Call the train method, which will create your tokenization model and save it to the model directory.

Now you will use the load() class method from an actual tokenizer class to load that trained model in and now you can use it on input data.

class gretel_synthetics.tokenizers.Base: High level base class for shared class attrs and validation. Should not be used directly.

class gretel_synthetics.tokenizers.BaseTokenizer(model_data: Any, model_dir: str)

Base class for loading a tokenizer from disk. Should not be used directly.

decode_from_ids(ids: List[int]) → str: Given a list of token IDs, convert it to a single string that would be the original string it was.

Note

We automatically call a method that can optionally restore any special reserved tokens back to their original values (such as field delimiter values, etc)

encode_to_ids(data: str) → List[int]: Given an input string, convert it to a list of token IDs

abstract classmethod load(model_dir: str): Given a directory to a model, load the specific tokenizer model into an instance. Subclasses should implement this logic specific to how they need to load a model back in

abstract property total_vocab_size: Return the total count of unique tokens in the vocab, specific to the underlying tokenizer to be used.

class gretel_synthetics.tokenizers.BaseTokenizerTrainer(*, config: None, vocab_size: int | None = None)

Base class for training tokenizers. Should not be used directly.

annotate_data() → Iterator[str]

This should be called _before_ training as it is required to have the annotated training data created in the model directory.

Read in the configurations raw input data path, and create a file I/O pipeline where each line of the input data path can optionally route through an annotation function and then we will write each raw line out into a training data file as specified by the config.

config: None: A subclass instace of BaseConfig. This will be used to find the input data for tokenization

data_iterator() → Iterator[str]: Create a generator that will iterate each line of the training data that was created during the annotation step. Synthetic model trainers will most likely need to iterate this to process each line of the annotated training data.

num_lines: int = 0: The number of lines that were processed after create_annotated_training_data is called

train(): Train a tokenizer and save the tokenizer settings to a file located in the model directory specified by the config object

vocab_size: int: The max size of the vocab (tokens) to be extracted from the input dataset.

class gretel_synthetics.tokenizers.CharTokenizer(model_data: Any, model_dir: str)

Load a simple character tokenizer from disk to conduct encoding an decoding operations

classmethod load(model_dir: str)

Create an instance of this tokenizer.

Parameters:: model_dir – The path to the model directory

property total_vocab_size: Get the number of unique characters (tokens)

class gretel_synthetics.tokenizers.CharTokenizerTrainer(*, config: None, vocab_size: int | None = None)

Train a simple tokenizer that maps every single character to a unique ID. If vocab_size is not specified, the learned vocab size will be the number of unique characters in the training dataset.

Parameters:: vocab_size – Max number of tokens (chars) to map to tokens.

class gretel_synthetics.tokenizers.SentencePieceColumnTokenizer(sp: SentencePieceProcessor, model_dir: str)

class gretel_synthetics.tokenizers.SentencePieceColumnTokenizerTrainer(col_pattern: str = '<col{}>', **kwargs)

class gretel_synthetics.tokenizers.SentencePieceTokenizer(model_data: Any, model_dir: str)

Load a SentencePiece tokenizer from disk so encoding / decoding can be done

classmethod load(model_dir: str)

Load a SentencePiece tokenizer from a model directory.

Parameters:: model_dir – The model directory.

property total_vocab_size: The number of unique tokens in the model

class gretel_synthetics.tokenizers.SentencePieceTokenizerTrainer(*, character_coverage: float = 1.0, pretrain_sentence_count: int = 1000000, max_line_len: int = 2048, **kwargs)

Train a tokenizer using Google SentencePiece.

character_coverage: float: The amount of characters covered by the model. Unknown characters will be replaced with the <unk> tag. Good defaults are 0.995 for languages with rich character sets like Japanese or Chinese, and 1.0 for other languages or machine data. Default is 1.0.

max_line_line: int: Maximum line length for input training data. Any lines longer than this length will be ignored. Default is 2048.

pretrain_sentence_count: int: The number of lines spm_train first loads. Remaining lines are simply discarded. Since spm_train loads entire corpus into memory, this size will depend on the memory size of the machine. It also affects training time. Default is 1000000.

vocab_size: int: Pre-determined maximum vocabulary size prior to neural model training, based on subword units including byte-pair-encoding (BPE) and unigram language model, with the extension of direct training from raw sentences. We generally recommend using a large vocabulary size of 20,000 to 50,000. Default is 20000.

exception gretel_synthetics.tokenizers.TokenizerError

exception gretel_synthetics.tokenizers.VocabSizeTooSmall: Error that is raised when the vocab_size is too small for the given data. This happens, when the vocab_size is set to a value that is smaller than the number of required characters.

gretel_synthetics.tokenizers.tokenizer_from_model_dir(model_dir: str) → BaseTokenizer

A factory function that will return a tokenizer instance that can be used for encoding / decoding data. It will try to automatically infer what type of class to use based on the stored tokenizer params in the provided model directory.

If no specific tokenizer type is found, we assume that we are restoring a SentencePiece tokenizer because the model is from a version <= 0.14.x

Parameters:: model_dir – A directory that holds synthetic model data.