Tokenizer Factory
The Tokenizer Factory is responsible for generating the tokenizer used in
training. The tokenizer is in turn used by the DatasetFactory to
tokenize the input data. A Tokenizer Factory can be created by inheriting from
the TokenizerFactory class and implementing the
create_tokenizer() method.
- class arctic_training.tokenizer.factory.TokenizerFactory(trainer, tokenizer_config=None)[source]
Bases:
ABC,CallbackMixinBase class for all tokenizer factories.
- Parameters:
trainer (Trainer)
tokenizer_config (TokenizerConfig | None)
-
name:
str The name of the tokenizer factory. This is used to identify the tokenizer factory in the registry.
-
config:
TokenizerConfig The configuration class for the tokenizer factory. This is used to validate the configuration passed to the factory.
- property device: str
- property world_size: int
- property global_rank: int
Attributes
Similar to other Factory classes in ArcticTraining, the TokenizerFactory class
must have a name attribute that is used to identify
the factory when registering it with ArcticTraining and a
config attribute type hint that is used to validate
the config object passed to the factory.
Properties
A TokenizerFactory has several attributes that can be used to access information
about the trainer and distributed state at runtime, including
device, trainer,
world_size, and global_rank.
Methods
TokenizerFactories have just one method that must be defined:
create_tokenizer(). The
create_tokenizer() method should return the tokenizer
object created using the passed tokenizer config.
HuggingFace Tokenizer Factory
A custom tokenizer factory can be created from the TokenizerFactory
building block, but ArcticTraining also comes with a TokenizerFactory
implementation that can be used out of the box: HFTokenizerFactory.
The HFTokenizerFactory will load tokenizers from HuggingFace Hub
given a path to a local repo or the tokenizer name.