This module provides the functionality to generate synthetic records.
Before using this module you must have already:
Created a config
Trained a model
generate_text(config: None, start_string: str = '<n>', line_validator: Callable = None, max_invalid: int = 1000, num_lines: int = None, parallelism: int = 0)¶
A generator that will load a model and start creating records.
config – A configuration object, which you must have created previously
start_string – A prefix string that is used to seed the record generation. By default we use a newline, but you may substitue any initial value here which will influence how the generator predicts what to generate.
line_validator – An optional callback validator function that will take the raw string value from the generator as a single argument. This validator can executue arbitrary code with the raw string value. The validator function may return a bool to indicate line validity. This boolean value will be set on the yielded
gen_textobject. Additionally, if the validator throws an exception, the
gen_textobject will be set with a failed validation. If the validator returns None, we will assume successful validation.
max_invalid – If using a
line_validator, this is the maximum number of invalid lines to generate. If the number of invalid lines exceeds this value a
RunTimeErrorwill be raised.
num_lines – If not
None, this will override the
gen_linesvalue that is provided in the
parallelism – The number of concurrent workers to use.
1(the default) disables parallelization, while a non-positive value means “number of CPUs + x” (i.e., use
0for using as many workers as there are CPUs). A floating-point value is interpreted as a fraction of the available CPUs, rounded down.
Simple validator example:
def my_validator(raw_line: str): parts = raw_line.split(',') if len(parts) != 5: raise Exception('record does not have 5 fields')
configis important for this function. If a line validator is not provided, each line will count towards the number of total generated lines. When the total lines generated is >=
gen_lineswe stop. If a line validator is provided, only valid lines will count towards the total number of lines generated. When the total number of valid lines generated is >=
gen_lines, we stop.
gen_chars, controls the possible maximum number of characters a single generated line can have. If a newline character has not been generated before reaching this number, then the line will be returned. For example if
gen_charsis 180 and a newline has not been generated, once 180 chars have been created, the line will be returned no matter what. As a note, if this value is 0, then each line will generate until a newline is observed.
gen_textobject for each record that is generated. The generator will stop after the max number of lines is reached (based on your config).
A RunTimeError if the max_invalid number of lines is generated –