Tokenizer Factory

The Tokenizer Factory is responsible for generating the tokenizer used in training. The tokenizer is in turn used by the DatasetFactory to tokenize the input data. A Tokenizer Factory can be created by inheriting from the TokenizerFactory class and implementing the create_tokenizer() method.

class arctic_training.tokenizer.factory.TokenizerFactory(trainer, tokenizer_config=None)[source]

Bases: ABC, CallbackMixin

Base class for all tokenizer factories.

Parameters:
name: str

The name of the tokenizer factory. This is used to identify the tokenizer factory in the registry.

config: TokenizerConfig

The configuration class for the tokenizer factory. This is used to validate the configuration passed to the factory.

property trainer: Trainer
property device: str
property world_size: int
property global_rank: int
abstract create_tokenizer()[source]

Creates the tokenizer.

Return type:

PreTrainedTokenizer

Attributes

Similar to other Factory classes in ArcticTraining, the TokenizerFactory class must have a name attribute that is used to identify the factory when registering it with ArcticTraining and a config attribute type hint that is used to validate the config object passed to the factory.

Properties

A TokenizerFactory has several attributes that can be used to access information about the trainer and distributed state at runtime, including device, trainer, world_size, and global_rank.

Methods

TokenizerFactories have just one method that must be defined: create_tokenizer(). The create_tokenizer() method should return the tokenizer object created using the passed tokenizer config.

HuggingFace Tokenizer Factory

A custom tokenizer factory can be created from the TokenizerFactory building block, but ArcticTraining also comes with a TokenizerFactory implementation that can be used out of the box: HFTokenizerFactory. The HFTokenizerFactory will load tokenizers from HuggingFace Hub given a path to a local repo or the tokenizer name.