Configuration
The main input to the ArcticTraining CLI is a YAML configuration file that
defines files for the TrainerConfig
class. This is a Pydantic configuration model that also contains the
sub-configurations for data, model, etc.
- pydantic model arctic_training.config.trainer.TrainerConfig[source]
Bases:
BaseConfigBase Trainer Configuration.
Show JSON schema
{ "title": "TrainerConfig", "description": "Base Trainer Configuration.", "type": "object", "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "sft", "description": "Trainer type. ", "title": "Type", "type": "string" }, "code": { "default": "train.py", "description": "Path to the python script containing custom trainer implementation. ", "format": "path", "title": "Code", "type": "string" }, "skip_validation": { "default": false, "description": "Skips validation of types for subconfigs and registered classes. ", "title": "Skip Validation", "type": "boolean" }, "model": { "$ref": "#/$defs/ModelConfig", "description": "Model configuration. " }, "tokenizer": { "$ref": "#/$defs/TokenizerConfig", "description": "Tokenizer configuration. " }, "data": { "$ref": "#/$defs/DataConfig", "description": "Train and eval data configuration. " }, "logger": { "$ref": "#/$defs/LoggerConfig", "description": "Logger configuration. " }, "wandb": { "$ref": "#/$defs/WandBConfig", "description": "Weights and Biases configuration. " }, "scheduler": { "$ref": "#/$defs/SchedulerConfig", "description": "Scheduler configuration. " }, "optimizer": { "$ref": "#/$defs/OptimizerConfig", "description": "Optimizer configuration. " }, "deepspeed": { "additionalProperties": true, "default": {}, "description": "DeepSpeed config dict. Will be automatically filled if not provided by the user. ", "title": "Deepspeed", "type": "object" }, "epochs": { "default": 1, "description": "Number of epochs to train. ", "minimum": 0, "title": "Epochs", "type": "integer" }, "loss_log_interval": { "default": 1, "description": "Number of steps between logging loss. ", "minimum": 0, "title": "Loss Log Interval", "type": "integer" }, "train_log_iter_interval": { "default": 1, "description": "Iters between training metric log outputs. `0` is off, only intervals of `1` currently supported. ", "enum": [ 0, 1 ], "title": "Train Log Iter Interval", "type": "integer" }, "train_log_metrics_path": { "default": "train-log-metrics.jsonl", "description": ".jsonl path to log precise metrics according to the `train_log_iter_interval` schedule. Defaults to `./train-log-metrics.jsonl` ", "format": "path", "title": "Train Log Metrics Path", "type": "string" }, "gradient_accumulation_steps": { "default": 1, "description": "Number of gradient accumulation steps. ", "minimum": 1, "title": "Gradient Accumulation Steps", "type": "integer" }, "micro_batch_size": { "default": 1, "description": "Micro batch size per GPU. ", "minimum": 1, "title": "Micro Batch Size", "type": "integer" }, "sequence_parallel_size": { "default": 1, "description": "Sequence Parallelism Degree. Disabled if set to 1 ", "minimum": 1, "title": "Sequence Parallel Size", "type": "integer" }, "activation_checkpoint_cpu_offload": { "default": false, "description": "Offload activation checkpoint tensors to cpu. Enables a much longer sequence length. It is not very beneficial if sequence length is <64k ", "title": "Activation Checkpoint Cpu Offload", "type": "boolean" }, "tiled_mlp_compute": { "default": false, "description": "Tile the MLP computation to save GPU memory. Currently only limited architectures supported, but can be expanded to more. ", "title": "Tiled Mlp Compute", "type": "boolean" }, "seed": { "default": 42, "description": "Random seed value for numpy, python.random, torch, and transformers. ", "minimum": 0, "title": "Seed", "type": "integer" }, "checkpoint": { "default": [], "description": "Checkpoint configurations. Multiple checkpoint engines may be used together. ", "items": { "$ref": "#/$defs/CheckpointConfig" }, "title": "Checkpoint", "type": "array" }, "train_iters": { "default": 0, "description": "Maximum number of training iterations. ", "minimum": 0, "title": "Train Iters", "type": "integer" }, "eval_frequency": { "default": 0, "minimum": 0, "title": "Eval Frequency", "type": "integer" }, "exit_iteration": { "default": 0, "description": "Force exit of training after specified iteration count (useful for debugging). ", "minimum": 0, "title": "Exit Iteration", "type": "integer" }, "min_iterations": { "default": 0, "description": "When >0, the training dataset will be replicated until there is enough data to run this many iterations. ", "minimum": 0, "title": "Min Iterations", "type": "integer" }, "overfit_first_batch": { "default": false, "description": "Train only on repetitions of the first training batch. Useful for development. ", "title": "Overfit First Batch", "type": "boolean" }, "mem_profiler": { "default": null, "description": "Enable memory profiling. ", "enum": [ null, "step", "e2e" ], "title": "Mem Profiler" }, "mem_profiler_dir": { "description": "Path to save memory profiling results. Defaults to `logger.output_dir/mem-prof`. ", "format": "path", "title": "Mem Profiler Dir", "type": "string" }, "mem_profiler_max_entries": { "default": 100000, "description": "Maximum number of entries to store in the memory profiler. ", "minimum": 1, "title": "Mem Profiler Max Entries", "type": "integer" }, "kill_switch_path": { "default": "/tmp/at_kill_switch", "description": "Path to a file that can be used to trigger a graceful shutdown mid-training (sets early exit to True). ", "format": "path", "title": "Kill Switch Path", "type": "string" } }, "$defs": { "CheckpointConfig": { "additionalProperties": false, "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Checkpoint engine type. ", "title": "Type", "type": "string" }, "output_dir": { "description": "Checkpoint output directory. If directory does not exist, it will be created. ", "format": "path", "title": "Output Dir", "type": "string" }, "enabled": { "default": true, "description": "Enable this checkpoint engine. ", "title": "Enabled", "type": "boolean" }, "auto_resume": { "default": false, "description": "If a checkpoint is found in the output directory, resume training from that checkpoint. ", "title": "Auto Resume", "type": "boolean" }, "save_every_n_steps": { "default": 0, "description": "How often to trigger a checkpoint save by training global step count. ", "minimum": 0, "title": "Save Every N Steps", "type": "integer" }, "save_every_n_epochs": { "default": 0, "description": "How often to trigger a checkpoint save by training epoch count. ", "minimum": 0, "title": "Save Every N Epochs", "type": "integer" }, "save_end_of_training": { "default": false, "description": "Whether to save a checkpoint at the end of training. ", "title": "Save End Of Training", "type": "boolean" } }, "required": [ "output_dir" ], "title": "CheckpointConfig", "type": "object" }, "DataConfig": { "additionalProperties": false, "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Data factory type. Defaults to the `data_factory_type` in the trainer. ", "title": "Type", "type": "string" }, "sources": { "description": "List of data sources to use for training. These must be registered `DataSource`. ", "items": { "$ref": "#/$defs/DataSourceConfig" }, "title": "Sources", "type": "array" }, "eval_sources": { "default": [], "description": "list of data sources to use for evaluation. These must be registered `DataSource`. ", "items": { "$ref": "#/$defs/DataSourceConfig" }, "title": "Eval Sources", "type": "array" }, "train_eval_split": { "default": [ 1.0, 0.0 ], "description": "How much of the training data to use for evaluation. ", "maxItems": 2, "minItems": 2, "prefixItems": [ { "type": "number" }, { "type": "number" } ], "title": "Train Eval Split", "type": "array" }, "max_length": { "default": 8192, "description": "Maximum length of the input sequence. ", "title": "Max Length", "type": "integer" }, "num_proc": { "default": 16, "description": "Number of processes to use for data loading. ", "title": "Num Proc", "type": "integer" }, "dl_num_workers": { "default": 2, "description": "Number of DL workers per gpu. ", "title": "Dl Num Workers", "type": "integer" }, "seed": { "default": 42, "description": "Seed for data loading. ", "title": "Seed", "type": "integer" }, "use_data_cache": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "description": "Whether to cache loaded data. ", "title": "Use Data Cache" }, "cache_processed_data": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "description": "Deprecated, please use \"use_data_cache\". ", "title": "Cache Processed Data" }, "cache_dir": { "default": "/tmp", "description": "Directory to store cached data. ", "format": "path", "title": "Cache Dir", "type": "string" }, "cache_fs_type": { "default": "auto", "enum": [ "auto", "local", "shared" ], "title": "Cache Fs Type", "type": "string" } }, "required": [ "sources" ], "title": "DataConfig", "type": "object" }, "DataSourceConfig": { "additionalProperties": false, "description": "Base DataSource configuration.", "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Data source type. Defaults to 'huggingface' if only a dataset name or path is provided.", "title": "Type", "type": "string" }, "split": { "default": "", "description": "Which split the data source is used for. This will be automatically set to either \"train\" or \"eval\" if no value is passed.\n\nFor HFDataSource, this can be any value supported by Dataset slice splits:\nhttps://huggingface.co/docs/datasets/en/loading#slice-splits.", "title": "Split", "type": "string" }, "sample_ratio": { "anyOf": [ { "type": "number" }, { "type": "null" } ], "default": null, "description": "Ratio of the dataset to randomly sample. If None, all examples are used.", "title": "Sample Ratio" }, "sample_count": { "anyOf": [ { "type": "integer" }, { "type": "null" } ], "default": null, "description": "Number of examples to randomly sample. If None, all examples are used.", "title": "Sample Count" }, "sample_seed": { "default": 42, "description": "Seed for random sampling. Used only if `sample_ratio` or `sample_count` is set.", "title": "Sample Seed", "type": "integer" }, "process": { "default": true, "description": "Whether to process the data with the data factory `process` function (e.g., tokenization for SFTDataFactory). ", "title": "Process", "type": "boolean" } }, "title": "DataSourceConfig", "type": "object" }, "LoggerConfig": { "additionalProperties": false, "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "output_dir": { "default": "/dev/null", "description": "Output directory for log files. ", "format": "path", "title": "Output Dir", "type": "string" }, "level": { "default": "WARNING", "description": "Log level for the logger. ", "title": "Level", "type": "string" }, "print_output_ranks": { "anyOf": [ { "const": "*", "type": "string" }, { "items": { "type": "integer" }, "type": "array" } ], "default": [ 0 ], "description": "Which ranks will print logs. Either a list of ranks or \"*\" for all ranks. ", "title": "Print Output Ranks" }, "file_output_ranks": { "anyOf": [ { "const": "*", "type": "string" }, { "items": { "type": "integer" }, "type": "array" } ], "default": "*", "description": "Which ranks will output logs to a file. Either a list of ranks or \"*\" for all ranks. ", "title": "File Output Ranks" } }, "title": "LoggerConfig", "type": "object" }, "ModelConfig": { "additionalProperties": false, "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Model factory type. ", "title": "Type", "type": "string" }, "name_or_path": { "anyOf": [ { "type": "string" }, { "format": "path", "type": "string" } ], "description": "Model name (as described in Hugging Face model hub) or local path to model checkpoint. ", "title": "Name Or Path" }, "dtype": { "default": "torch.bfloat16", "description": "Data type for model weights. ", "examples": [ "float32", "bfloat16" ], "title": "Torch Dtype", "type": "string" }, "save_name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Name to use when saving the model. ", "title": "Save Name" }, "attn_implementation": { "default": "sdpa", "description": "Attention implementation to use. ", "title": "Attn Implementation", "type": "string" }, "disable_activation_checkpoint": { "default": false, "description": "Disable the use of activation checkpointing. ", "title": "Disable Activation Checkpoint", "type": "boolean" }, "peft_config": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "description": "Configuration for Parameter Efficient Fine Tuning. ", "title": "Peft Config" } }, "required": [ "name_or_path" ], "title": "ModelConfig", "type": "object" }, "OptimizerConfig": { "additionalProperties": false, "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Optimizer factory type. Defaults to the `optimizer_factory_type` of the trainer. ", "title": "Type", "type": "string" }, "weight_decay": { "default": 0.1, "description": "Coefficient for L2 regularization applied to the optimizer's weights. ", "minimum": 0.0, "title": "Weight Decay", "type": "number" }, "betas": { "default": [ 0.9, 0.999 ], "description": "Tuple of coefficients used for computing running averages of gradient and its square (e.g., (beta1, beta2) for Adam). ", "maxItems": 2, "minItems": 2, "prefixItems": [ { "type": "number" }, { "type": "number" } ], "title": "Betas", "type": "array" }, "lr": { "default": 0.0005, "description": "The initial learning rate. ", "minimum": 0.0, "title": "Lr", "type": "number" } }, "title": "OptimizerConfig", "type": "object" }, "SchedulerConfig": { "additionalProperties": false, "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Scheduler factory type. Defaults to the `scheduler_factory_type` of the trainer. ", "title": "Type", "type": "string" }, "lr": { "anyOf": [ { "type": "number" }, { "type": "null" } ], "default": null, "description": "The initial learning rate. Deprecated in favor of `optimizer.learning_rate`. ", "title": "Lr" } }, "title": "SchedulerConfig", "type": "object" }, "TokenizerConfig": { "additionalProperties": false, "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Tokenizer factory type. Defaults to the `tokenizer_factory_type` of the trainer. ", "title": "Type", "type": "string" }, "name_or_path": { "anyOf": [ { "type": "string" }, { "format": "path", "type": "string" }, { "type": "null" } ], "default": "", "description": "Tokenizer name (as described in Hugging Face model hub) or local path directory containing tokenizer. ", "title": "Name Or Path" } }, "title": "TokenizerConfig", "type": "object" }, "WandBConfig": { "additionalProperties": false, "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "enable": { "default": false, "description": "Whether to enable Weights and Biases logging. ", "title": "Enable", "type": "boolean" }, "entity": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Weights and Biases entity name. ", "title": "Entity" }, "project": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": "arctic-training", "description": "Weights and Biases project name. ", "title": "Project" }, "name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Weights and Biases run name. ", "title": "Name" } }, "title": "WandBConfig", "type": "object" } }, "additionalProperties": false, "required": [ "model", "data" ] }
- Config:
extra: str = forbid
use_enum_values: bool = True
validate_default: bool = True
use_attribute_docstrings: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True
- Fields:
- Validators:
build_deepspeed_config»all fieldscoerce_deepspeed_human_friendly_values»deepspeedinit_checkpoint_configs»checkpointinit_data_config»datainit_dist»all fieldsinit_model_config»modelinit_optimizer_config»optimizerinit_scheduler_config»schedulerinit_tokenizer_config»tokenizerinitialize_logger»loggermem_profiler_mkdir»all fieldsset_tokenizer»all fieldstrain_log_metrics_path_prep»all fieldsvalidate_eval_frequency»all fieldsvalidate_single_checkpoint_resume»all fields
-
field type:
str= 'sft' Trainer type.
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field code:
Path= PosixPath('train.py') Path to the python script containing custom trainer implementation.
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field skip_validation:
bool= False Skips validation of types for subconfigs and registered classes.
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field model:
ModelConfig[Required] Model configuration.
- Validated by:
build_deepspeed_configinit_distinit_model_configmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field tokenizer:
TokenizerConfig[Optional] Tokenizer configuration.
- Validated by:
build_deepspeed_configinit_distinit_tokenizer_configmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field data:
DataConfig[Required] Train and eval data configuration.
- Validated by:
build_deepspeed_configinit_data_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field logger:
LoggerConfig[Optional] Logger configuration.
- Validated by:
build_deepspeed_configinit_distinitialize_loggermem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field wandb:
WandBConfig[Optional] Weights and Biases configuration.
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field scheduler:
SchedulerConfig[Optional] Scheduler configuration.
- Validated by:
build_deepspeed_configinit_distinit_scheduler_configmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field optimizer:
OptimizerConfig[Optional] Optimizer configuration.
- Validated by:
build_deepspeed_configinit_distinit_optimizer_configmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field deepspeed:
Dict[str,Any] = {} DeepSpeed config dict. Will be automatically filled if not provided by the user.
- Validated by:
build_deepspeed_configcoerce_deepspeed_human_friendly_valuesinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field epochs:
int= 1 Number of epochs to train.
- Constraints:
ge = 0
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field loss_log_interval:
Annotated[int] = 1 Number of steps between logging loss.
- Constraints:
ge = 0
func = <function parse_human_val at 0x7d029ece95a0>
json_schema_input_type = PydanticUndefined
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field train_log_iter_interval:
Literal[0,1] = 1 Iters between training metric log outputs. 0 is off, only intervals of 1 currently supported.
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field train_log_metrics_path:
Path= PosixPath('train-log-metrics.jsonl') .jsonl path to log precise metrics according to the train_log_iter_interval schedule. Defaults to ./train-log-metrics.jsonl
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field gradient_accumulation_steps:
int= 1 Number of gradient accumulation steps.
- Constraints:
ge = 1
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field micro_batch_size:
int= 1 Micro batch size per GPU.
- Constraints:
ge = 1
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field sequence_parallel_size:
int= 1 Sequence Parallelism Degree. Disabled if set to 1
- Constraints:
ge = 1
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field activation_checkpoint_cpu_offload:
bool= False Offload activation checkpoint tensors to cpu. Enables a much longer sequence length. It is not very beneficial if sequence length is <64k
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field tiled_mlp_compute:
bool= False Tile the MLP computation to save GPU memory. Currently only limited architectures supported, but can be expanded to more.
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field seed:
int= 42 Random seed value for numpy, python.random, torch, and transformers.
- Constraints:
ge = 0
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field checkpoint:
List[CheckpointConfig] = [] Checkpoint configurations. Multiple checkpoint engines may be used together.
- Validated by:
build_deepspeed_configinit_checkpoint_configsinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field train_iters:
Annotated[int] = 0 Maximum number of training iterations.
- Constraints:
ge = 0
func = <function parse_human_val at 0x7d029ece95a0>
json_schema_input_type = PydanticUndefined
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field exit_iteration:
int= 0 Force exit of training after specified iteration count (useful for debugging).
- Constraints:
ge = 0
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field min_iterations:
Annotated[int] = 0 When >0, the training dataset will be replicated until there is enough data to run this many iterations.
- Constraints:
ge = 0
func = <function parse_human_val at 0x7d029ece95a0>
json_schema_input_type = PydanticUndefined
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field overfit_first_batch:
bool= False Train only on repetitions of the first training batch. Useful for development.
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field mem_profiler:
Literal[None,'step','e2e'] = None Enable memory profiling.
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field mem_profiler_dir:
Path[Optional] Path to save memory profiling results. Defaults to logger.output_dir/mem-prof.
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field mem_profiler_max_entries:
Annotated[int] = 100000 Maximum number of entries to store in the memory profiler.
- Constraints:
ge = 1
func = <function parse_human_val at 0x7d029ece95a0>
json_schema_input_type = PydanticUndefined
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
-
field kill_switch_path:
Path= PosixPath('/tmp/at_kill_switch') Path to a file that can be used to trigger a graceful shutdown mid-training (sets early exit to True).
- Validated by:
build_deepspeed_configinit_distmem_profiler_mkdirset_tokenizertrain_log_metrics_path_prepvalidate_eval_frequencyvalidate_single_checkpoint_resume
- pydantic model arctic_training.config.checkpoint.CheckpointConfig[source]
Bases:
BaseConfigShow JSON schema
{ "title": "CheckpointConfig", "type": "object", "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Checkpoint engine type. ", "title": "Type", "type": "string" }, "output_dir": { "description": "Checkpoint output directory. If directory does not exist, it will be created. ", "format": "path", "title": "Output Dir", "type": "string" }, "enabled": { "default": true, "description": "Enable this checkpoint engine. ", "title": "Enabled", "type": "boolean" }, "auto_resume": { "default": false, "description": "If a checkpoint is found in the output directory, resume training from that checkpoint. ", "title": "Auto Resume", "type": "boolean" }, "save_every_n_steps": { "default": 0, "description": "How often to trigger a checkpoint save by training global step count. ", "minimum": 0, "title": "Save Every N Steps", "type": "integer" }, "save_every_n_epochs": { "default": 0, "description": "How often to trigger a checkpoint save by training epoch count. ", "minimum": 0, "title": "Save Every N Epochs", "type": "integer" }, "save_end_of_training": { "default": false, "description": "Whether to save a checkpoint at the end of training. ", "title": "Save End Of Training", "type": "boolean" } }, "additionalProperties": false, "required": [ "output_dir" ] }
- Config:
extra: str = forbid
use_enum_values: bool = True
validate_default: bool = True
use_attribute_docstrings: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True
- Fields:
- Validators:
resolve_output_dir»output_dir
-
field type:
str= '' Checkpoint engine type.
-
field output_dir:
Path[Required] Checkpoint output directory. If directory does not exist, it will be created.
- Validated by:
resolve_output_dir
-
field enabled:
bool= True Enable this checkpoint engine.
-
field auto_resume:
bool= False If a checkpoint is found in the output directory, resume training from that checkpoint.
-
field save_every_n_steps:
Annotated[int] = 0 How often to trigger a checkpoint save by training global step count.
- Constraints:
ge = 0
func = <function parse_human_val at 0x7d029ece95a0>
json_schema_input_type = PydanticUndefined
-
field save_every_n_epochs:
Annotated[int] = 0 How often to trigger a checkpoint save by training epoch count.
- Constraints:
ge = 0
func = <function parse_human_val at 0x7d029ece95a0>
json_schema_input_type = PydanticUndefined
-
field save_end_of_training:
bool= False Whether to save a checkpoint at the end of training.
- pydantic model arctic_training.config.data.DataConfig[source]
Bases:
BaseConfigShow JSON schema
{ "title": "DataConfig", "type": "object", "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Data factory type. Defaults to the `data_factory_type` in the trainer. ", "title": "Type", "type": "string" }, "sources": { "description": "List of data sources to use for training. These must be registered `DataSource`. ", "items": { "$ref": "#/$defs/DataSourceConfig" }, "title": "Sources", "type": "array" }, "eval_sources": { "default": [], "description": "list of data sources to use for evaluation. These must be registered `DataSource`. ", "items": { "$ref": "#/$defs/DataSourceConfig" }, "title": "Eval Sources", "type": "array" }, "train_eval_split": { "default": [ 1.0, 0.0 ], "description": "How much of the training data to use for evaluation. ", "maxItems": 2, "minItems": 2, "prefixItems": [ { "type": "number" }, { "type": "number" } ], "title": "Train Eval Split", "type": "array" }, "max_length": { "default": 8192, "description": "Maximum length of the input sequence. ", "title": "Max Length", "type": "integer" }, "num_proc": { "default": 16, "description": "Number of processes to use for data loading. ", "title": "Num Proc", "type": "integer" }, "dl_num_workers": { "default": 2, "description": "Number of DL workers per gpu. ", "title": "Dl Num Workers", "type": "integer" }, "seed": { "default": 42, "description": "Seed for data loading. ", "title": "Seed", "type": "integer" }, "use_data_cache": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "description": "Whether to cache loaded data. ", "title": "Use Data Cache" }, "cache_processed_data": { "anyOf": [ { "type": "boolean" }, { "type": "null" } ], "default": null, "description": "Deprecated, please use \"use_data_cache\". ", "title": "Cache Processed Data" }, "cache_dir": { "default": "/tmp", "description": "Directory to store cached data. ", "format": "path", "title": "Cache Dir", "type": "string" }, "cache_fs_type": { "default": "auto", "enum": [ "auto", "local", "shared" ], "title": "Cache Fs Type", "type": "string" } }, "$defs": { "DataSourceConfig": { "additionalProperties": false, "description": "Base DataSource configuration.", "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Data source type. Defaults to 'huggingface' if only a dataset name or path is provided.", "title": "Type", "type": "string" }, "split": { "default": "", "description": "Which split the data source is used for. This will be automatically set to either \"train\" or \"eval\" if no value is passed.\n\nFor HFDataSource, this can be any value supported by Dataset slice splits:\nhttps://huggingface.co/docs/datasets/en/loading#slice-splits.", "title": "Split", "type": "string" }, "sample_ratio": { "anyOf": [ { "type": "number" }, { "type": "null" } ], "default": null, "description": "Ratio of the dataset to randomly sample. If None, all examples are used.", "title": "Sample Ratio" }, "sample_count": { "anyOf": [ { "type": "integer" }, { "type": "null" } ], "default": null, "description": "Number of examples to randomly sample. If None, all examples are used.", "title": "Sample Count" }, "sample_seed": { "default": 42, "description": "Seed for random sampling. Used only if `sample_ratio` or `sample_count` is set.", "title": "Sample Seed", "type": "integer" }, "process": { "default": true, "description": "Whether to process the data with the data factory `process` function (e.g., tokenization for SFTDataFactory). ", "title": "Process", "type": "boolean" } }, "title": "DataSourceConfig", "type": "object" } }, "additionalProperties": false, "required": [ "sources" ] }
- Config:
extra: str = forbid
use_enum_values: bool = True
validate_default: bool = True
use_attribute_docstrings: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True
- Fields:
cache_fs_type (Literal['auto', 'local', 'shared'])eval_sources (List[arctic_training.config.data.DataSourceConfig])sources (List[arctic_training.config.data.DataSourceConfig])
- Validators:
deprecate_cache_processed_data»cache_processed_datadeprecate_cache_processed_data»use_data_cacheresolve_cache_dir»cache_dirset_cache_fs_type»all fieldsvalidate_cache_dir»all fieldsvalidate_train_eval_split»all fields
-
field type:
str= '' Data factory type. Defaults to the data_factory_type in the trainer.
- Validated by:
set_cache_fs_typevalidate_cache_dirvalidate_train_eval_split
-
field sources:
List[DataSourceConfig] [Required] List of data sources to use for training. These must be registered DataSource.
- Validated by:
set_cache_fs_typevalidate_cache_dirvalidate_train_eval_split
-
field eval_sources:
List[DataSourceConfig] = [] list of data sources to use for evaluation. These must be registered DataSource.
- Validated by:
set_cache_fs_typevalidate_cache_dirvalidate_train_eval_split
-
field train_eval_split:
Tuple[float,float] = (1.0, 0.0) How much of the training data to use for evaluation.
- Validated by:
set_cache_fs_typevalidate_cache_dirvalidate_train_eval_split
-
field max_length:
Annotated[int] = 8192 Maximum length of the input sequence.
- Constraints:
func = <function parse_human_val at 0x7d029ece95a0>
json_schema_input_type = PydanticUndefined
- Validated by:
set_cache_fs_typevalidate_cache_dirvalidate_train_eval_split
-
field num_proc:
int= 16 Number of processes to use for data loading.
- Validated by:
set_cache_fs_typevalidate_cache_dirvalidate_train_eval_split
-
field dl_num_workers:
int= 2 Number of DL workers per gpu.
- Validated by:
set_cache_fs_typevalidate_cache_dirvalidate_train_eval_split
-
field seed:
int= 42 Seed for data loading.
- Validated by:
set_cache_fs_typevalidate_cache_dirvalidate_train_eval_split
-
field use_data_cache:
Optional[bool] = None Whether to cache loaded data.
- Validated by:
deprecate_cache_processed_dataset_cache_fs_typevalidate_cache_dirvalidate_train_eval_split
-
field cache_processed_data:
Optional[bool] = None Deprecated, please use “use_data_cache”.
- Validated by:
deprecate_cache_processed_dataset_cache_fs_typevalidate_cache_dirvalidate_train_eval_split
-
field cache_dir:
Path= PosixPath('/tmp') Directory to store cached data.
- Validated by:
resolve_cache_dirset_cache_fs_typevalidate_cache_dirvalidate_train_eval_split
- validator init_source_configs » eval_sources, sources[source]
Convert string and dict input to correct subclass of DataSourceConfig. If a string is passed, “huggingface” is used as the DataSource type.
- Return type:
List[DataSourceConfig]- Parameters:
v (List[str | Dict | DataSourceConfig])
info (ValidationInfo)
- pydantic model arctic_training.config.logger.LoggerConfig[source]
Bases:
BaseConfigShow JSON schema
{ "title": "LoggerConfig", "type": "object", "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "output_dir": { "default": "/dev/null", "description": "Output directory for log files. ", "format": "path", "title": "Output Dir", "type": "string" }, "level": { "default": "WARNING", "description": "Log level for the logger. ", "title": "Level", "type": "string" }, "print_output_ranks": { "anyOf": [ { "const": "*", "type": "string" }, { "items": { "type": "integer" }, "type": "array" } ], "default": [ 0 ], "description": "Which ranks will print logs. Either a list of ranks or \"*\" for all ranks. ", "title": "Print Output Ranks" }, "file_output_ranks": { "anyOf": [ { "const": "*", "type": "string" }, { "items": { "type": "integer" }, "type": "array" } ], "default": "*", "description": "Which ranks will output logs to a file. Either a list of ranks or \"*\" for all ranks. ", "title": "File Output Ranks" } }, "additionalProperties": false }
- Config:
extra: str = forbid
use_enum_values: bool = True
validate_default: bool = True
use_attribute_docstrings: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True
- Fields:
- Validators:
fill_output_ranks»all fieldsset_wandb_output_dir»all fields
-
field output_dir:
Path= PosixPath('/dev/null') Output directory for log files.
- Validated by:
fill_output_ranksset_wandb_output_dir
-
field level:
str= 'WARNING' Log level for the logger.
- Validated by:
fill_output_ranksset_wandb_output_dir
-
field print_output_ranks:
Union[Literal['*'],List[int]] = [0] Which ranks will print logs. Either a list of ranks or “*” for all ranks.
- Validated by:
fill_output_ranksset_wandb_output_dir
-
field file_output_ranks:
Union[Literal['*'],List[int]] = '*' Which ranks will output logs to a file. Either a list of ranks or “*” for all ranks.
- Validated by:
fill_output_ranksset_wandb_output_dir
- pydantic model arctic_training.config.model.ModelConfig[source]
Bases:
BaseConfigShow JSON schema
{ "title": "ModelConfig", "type": "object", "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Model factory type. ", "title": "Type", "type": "string" }, "name_or_path": { "anyOf": [ { "type": "string" }, { "format": "path", "type": "string" } ], "description": "Model name (as described in Hugging Face model hub) or local path to model checkpoint. ", "title": "Name Or Path" }, "dtype": { "default": "torch.bfloat16", "description": "Data type for model weights. ", "examples": [ "float32", "bfloat16" ], "title": "Torch Dtype", "type": "string" }, "save_name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Name to use when saving the model. ", "title": "Save Name" }, "attn_implementation": { "default": "sdpa", "description": "Attention implementation to use. ", "title": "Attn Implementation", "type": "string" }, "disable_activation_checkpoint": { "default": false, "description": "Disable the use of activation checkpointing. ", "title": "Disable Activation Checkpoint", "type": "boolean" }, "peft_config": { "anyOf": [ { "additionalProperties": true, "type": "object" }, { "type": "null" } ], "default": null, "description": "Configuration for Parameter Efficient Fine Tuning. ", "title": "Peft Config" } }, "additionalProperties": false, "required": [ "name_or_path" ] }
- Config:
extra: str = forbid
use_enum_values: bool = True
validate_default: bool = True
use_attribute_docstrings: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True
- Fields:
- Validators:
validate_attn_implementation»attn_implementationvalidate_peft_config_type»peft_config
-
field type:
str= '' Model factory type.
-
field name_or_path:
Union[str,Path] [Required] Model name (as described in Hugging Face model hub) or local path to model checkpoint.
-
field dtype:
DType= DType.BF16 Data type for model weights.
-
field save_name:
Optional[str] = None Name to use when saving the model.
-
field attn_implementation:
str= 'sdpa' Attention implementation to use.
- Validated by:
validate_attn_implementation
-
field disable_activation_checkpoint:
bool= False Disable the use of activation checkpointing.
-
field peft_config:
Optional[Dict] = None Configuration for Parameter Efficient Fine Tuning.
- Validated by:
validate_peft_config_type
- pydantic model arctic_training.config.optimizer.OptimizerConfig[source]
Bases:
BaseConfigShow JSON schema
{ "title": "OptimizerConfig", "type": "object", "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Optimizer factory type. Defaults to the `optimizer_factory_type` of the trainer. ", "title": "Type", "type": "string" }, "weight_decay": { "default": 0.1, "description": "Coefficient for L2 regularization applied to the optimizer's weights. ", "minimum": 0.0, "title": "Weight Decay", "type": "number" }, "betas": { "default": [ 0.9, 0.999 ], "description": "Tuple of coefficients used for computing running averages of gradient and its square (e.g., (beta1, beta2) for Adam). ", "maxItems": 2, "minItems": 2, "prefixItems": [ { "type": "number" }, { "type": "number" } ], "title": "Betas", "type": "array" }, "lr": { "default": 0.0005, "description": "The initial learning rate. ", "minimum": 0.0, "title": "Lr", "type": "number" } }, "additionalProperties": false }
- Config:
extra: str = forbid
use_enum_values: bool = True
validate_default: bool = True
use_attribute_docstrings: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True
- Fields:
-
field type:
str= '' Optimizer factory type. Defaults to the optimizer_factory_type of the trainer.
-
field weight_decay:
Annotated[float] = 0.1 Coefficient for L2 regularization applied to the optimizer’s weights.
- Constraints:
ge = 0.0
func = <function parse_human_val at 0x7d029ece95a0>
json_schema_input_type = PydanticUndefined
-
field betas:
Tuple[float,float] = (0.9, 0.999) Tuple of coefficients used for computing running averages of gradient and its square (e.g., (beta1, beta2) for Adam).
-
field learning_rate:
Annotated[float] = 0.0005 (alias 'lr') The initial learning rate.
- Constraints:
ge = 0.0
func = <function parse_human_val at 0x7d029ece95a0>
json_schema_input_type = PydanticUndefined
- pydantic model arctic_training.config.scheduler.SchedulerConfig[source]
Bases:
BaseConfigShow JSON schema
{ "title": "SchedulerConfig", "type": "object", "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Scheduler factory type. Defaults to the `scheduler_factory_type` of the trainer. ", "title": "Type", "type": "string" }, "lr": { "anyOf": [ { "type": "number" }, { "type": "null" } ], "default": null, "description": "The initial learning rate. Deprecated in favor of `optimizer.learning_rate`. ", "title": "Lr" } }, "additionalProperties": false }
- Config:
extra: str = forbid
use_enum_values: bool = True
validate_default: bool = True
use_attribute_docstrings: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True
- Fields:
- Validators:
_deprecated_learning_rate»learning_rate
-
field type:
str= '' Scheduler factory type. Defaults to the scheduler_factory_type of the trainer.
-
field learning_rate:
Optional[float] = None (alias 'lr') The initial learning rate. Deprecated in favor of optimizer.learning_rate.
- Validated by:
_deprecated_learning_rate
- pydantic model arctic_training.config.tokenizer.TokenizerConfig[source]
Bases:
BaseConfigShow JSON schema
{ "title": "TokenizerConfig", "type": "object", "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "type": { "default": "", "description": "Tokenizer factory type. Defaults to the `tokenizer_factory_type` of the trainer. ", "title": "Type", "type": "string" }, "name_or_path": { "anyOf": [ { "type": "string" }, { "format": "path", "type": "string" }, { "type": "null" } ], "default": "", "description": "Tokenizer name (as described in Hugging Face model hub) or local path directory containing tokenizer. ", "title": "Name Or Path" } }, "additionalProperties": false }
- Config:
extra: str = forbid
use_enum_values: bool = True
validate_default: bool = True
use_attribute_docstrings: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True
- Fields:
-
field type:
str= '' Tokenizer factory type. Defaults to the tokenizer_factory_type of the trainer.
-
field name_or_path:
Union[str,Path,None] = '' Tokenizer name (as described in Hugging Face model hub) or local path directory containing tokenizer.
- pydantic model arctic_training.config.wandb.WandBConfig[source]
Bases:
BaseConfigShow JSON schema
{ "title": "WandBConfig", "type": "object", "properties": { "local_rank": { "title": "Local Rank", "type": "integer" }, "global_rank": { "title": "Global Rank", "type": "integer" }, "world_size": { "title": "World Size", "type": "integer" }, "enable": { "default": false, "description": "Whether to enable Weights and Biases logging. ", "title": "Enable", "type": "boolean" }, "entity": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Weights and Biases entity name. ", "title": "Entity" }, "project": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": "arctic-training", "description": "Weights and Biases project name. ", "title": "Project" }, "name": { "anyOf": [ { "type": "string" }, { "type": "null" } ], "default": null, "description": "Weights and Biases run name. ", "title": "Name" } }, "additionalProperties": false }
- Config:
extra: str = forbid
use_enum_values: bool = True
validate_default: bool = True
use_attribute_docstrings: bool = True
populate_by_name: bool = True
validate_by_alias: bool = True
validate_by_name: bool = True
- Fields:
-
field enable:
bool= False Whether to enable Weights and Biases logging.
-
field entity:
Optional[str] = None Weights and Biases entity name.
-
field project:
Optional[str] = 'arctic-training' Weights and Biases project name.
-
field name:
Optional[str] = None Weights and Biases run name.
Numerical Formatting
When specifying numerical values in the configuration file, you can use human-friendly strings to represent very large or very small numbers. The following formats are supported:
X%: This format represents a percentage. For example,50%is equivalent to0.5.XeY: This format represents a number in scientific notation. For example,1e-6is equivalent to0.000001.X^Y: This format represents a number raised to a power. For example,2^20is equivalent to1048576.XK: This format represents a number in thousands (base 10). For example,1Kis equivalent to1000. Similarly you can useMfor millions,Bfor billions, andTfor trillions.1Ki: This format represents a number in kibibytes (base 2). For example,1Kiis equivalent to1024. Similarly you can useMifor mebibytes,Gifor gibibytes, andTifor tebibytes.