This module provides the functionality to generate synthetic records.
Before using this module you must have already:
Created a config
Trained a model
generate_text(config: None, start_string: str = '<n>', line_validator: Callable = None, max_invalid: int = 1000, num_lines: int = None, parallelism: int = 0)¶
A generator that will load a model and start creating records.
config – A configuration object, which you must have created previously
start_string – A prefix string that is used to seed the record generation. By default we use a newline, but you may substitue any initial value here which will influence how the generator predicts what to generate. If you are working with a field delimiter, and you want to seed more than one column value, then you MUST utilize the field delimiter specified in your config. An example would be “foo,bar,baz,”. Also, if using a field delimiter, the string MUST end with the delimiter value.
line_validator – An optional callback validator function that will take the raw string value from the generator as a single argument. This validator can executue arbitrary code with the raw string value. The validator function may return a bool to indicate line validity. This boolean value will be set on the yielded
gen_textobject. Additionally, if the validator throws an exception, the
gen_textobject will be set with a failed validation. If the validator returns None, we will assume successful validation.
max_invalid – If using a
line_validator, this is the maximum number of invalid lines to generate. If the number of invalid lines exceeds this value a
RunTimeErrorwill be raised.
num_lines – If not
None, this will override the
gen_linesvalue that is provided in the
parallelism – The number of concurrent workers to use.
1(the default) disables parallelization, while a non-positive value means “number of CPUs + x” (i.e., use
0for using as many workers as there are CPUs). A floating-point value is interpreted as a fraction of the available CPUs, rounded down.
Simple validator example:
def my_validator(raw_line: str): parts = raw_line.split(',') if len(parts) != 5: raise Exception('record does not have 5 fields')
configis important for this function. If a line validator is not provided, each line will count towards the number of total generated lines. When the total lines generated is >=
gen_lineswe stop. If a line validator is provided, only valid lines will count towards the total number of lines generated. When the total number of valid lines generated is >=
gen_lines, we stop.
gen_chars, controls the possible maximum number of characters a single generated line can have. If a newline character has not been generated before reaching this number, then the line will be returned. For example if
gen_charsis 180 and a newline has not been generated, once 180 chars have been created, the line will be returned no matter what. As a note, if this value is 0, then each line will generate until a newline is observed.
gen_textobject for each record that is generated. The generator will stop after the max number of lines is reached (based on your config).
A RunTimeError if the max_invalid number of lines is generated –