Configuration

The main input to the ArcticTraining CLI is a YAML configuration file that defines files for the TrainerConfig class. This is a Pydantic configuration model that also contains the sub-configurations for data, model, etc.

pydantic model arctic_training.config.trainer.TrainerConfig[source]

Bases: BaseConfig

Base Trainer Configuration.

Show JSON schema
{
   "title": "TrainerConfig",
   "description": "Base Trainer Configuration.",
   "type": "object",
   "properties": {
      "local_rank": {
         "title": "Local Rank",
         "type": "integer"
      },
      "global_rank": {
         "title": "Global Rank",
         "type": "integer"
      },
      "world_size": {
         "title": "World Size",
         "type": "integer"
      },
      "type": {
         "default": "sft",
         "description": "Trainer type. ",
         "title": "Type",
         "type": "string"
      },
      "code": {
         "default": "train.py",
         "description": "Path to the python script containing custom trainer implementation. ",
         "format": "path",
         "title": "Code",
         "type": "string"
      },
      "skip_validation": {
         "default": false,
         "description": "Skips validation of types for subconfigs and registered classes. ",
         "title": "Skip Validation",
         "type": "boolean"
      },
      "model": {
         "$ref": "#/$defs/ModelConfig",
         "description": "Model configuration. "
      },
      "tokenizer": {
         "$ref": "#/$defs/TokenizerConfig",
         "description": "Tokenizer configuration. "
      },
      "data": {
         "$ref": "#/$defs/DataConfig",
         "description": "Train and eval data configuration. "
      },
      "logger": {
         "$ref": "#/$defs/LoggerConfig",
         "description": "Logger configuration. "
      },
      "wandb": {
         "$ref": "#/$defs/WandBConfig",
         "description": "Weights and Biases configuration. "
      },
      "scheduler": {
         "$ref": "#/$defs/SchedulerConfig",
         "description": "Scheduler configuration. "
      },
      "optimizer": {
         "$ref": "#/$defs/OptimizerConfig",
         "description": "Optimizer configuration. "
      },
      "deepspeed": {
         "additionalProperties": true,
         "default": {},
         "description": "DeepSpeed config dict. Will be automatically filled if not provided by the user. ",
         "title": "Deepspeed",
         "type": "object"
      },
      "epochs": {
         "default": 1,
         "description": "Number of epochs to train. ",
         "minimum": 0,
         "title": "Epochs",
         "type": "integer"
      },
      "loss_log_interval": {
         "default": 1,
         "description": "Number of steps between logging loss. ",
         "minimum": 0,
         "title": "Loss Log Interval",
         "type": "integer"
      },
      "train_log_iter_interval": {
         "default": 1,
         "description": "Iters between training metric log outputs. `0` is off, only intervals of `1` currently supported. ",
         "enum": [
            0,
            1
         ],
         "title": "Train Log Iter Interval",
         "type": "integer"
      },
      "train_log_metrics_path": {
         "default": "train-log-metrics.jsonl",
         "description": ".jsonl path to log precise metrics according to the `train_log_iter_interval` schedule. Defaults to `./train-log-metrics.jsonl` ",
         "format": "path",
         "title": "Train Log Metrics Path",
         "type": "string"
      },
      "gradient_accumulation_steps": {
         "default": 1,
         "description": "Number of gradient accumulation steps. ",
         "minimum": 1,
         "title": "Gradient Accumulation Steps",
         "type": "integer"
      },
      "micro_batch_size": {
         "default": 1,
         "description": "Micro batch size per GPU. ",
         "minimum": 1,
         "title": "Micro Batch Size",
         "type": "integer"
      },
      "sequence_parallel_size": {
         "default": 1,
         "description": "Sequence Parallelism Degree. Disabled if set to 1 ",
         "minimum": 1,
         "title": "Sequence Parallel Size",
         "type": "integer"
      },
      "activation_checkpoint_cpu_offload": {
         "default": false,
         "description": "Offload activation checkpoint tensors to cpu. Enables a much longer sequence length. It is not very beneficial if sequence length is <64k  ",
         "title": "Activation Checkpoint Cpu Offload",
         "type": "boolean"
      },
      "tiled_mlp_compute": {
         "default": false,
         "description": "Tile the MLP computation to save GPU memory. Currently only limited architectures supported, but can be expanded to more. ",
         "title": "Tiled Mlp Compute",
         "type": "boolean"
      },
      "seed": {
         "default": 42,
         "description": "Random seed value for numpy, python.random, torch, and transformers. ",
         "minimum": 0,
         "title": "Seed",
         "type": "integer"
      },
      "checkpoint": {
         "default": [],
         "description": "Checkpoint configurations. Multiple checkpoint engines may be used together. ",
         "items": {
            "$ref": "#/$defs/CheckpointConfig"
         },
         "title": "Checkpoint",
         "type": "array"
      },
      "train_iters": {
         "default": 0,
         "description": "Maximum number of training iterations. ",
         "minimum": 0,
         "title": "Train Iters",
         "type": "integer"
      },
      "eval_interval": {
         "default": 0,
         "description": "Number of iterations between evaluations. If 0, no evaluation is performed. ",
         "minimum": 0,
         "title": "Eval Interval",
         "type": "integer"
      },
      "eval_log_iter_interval": {
         "default": 1,
         "description": "Iters between eval metric log outputs. `0` is off. ",
         "minimum": 0,
         "title": "Eval Log Iter Interval",
         "type": "integer"
      },
      "exit_iteration": {
         "default": 0,
         "description": "Do not continue training after specified iteration count even if there is still data and epochs to run (useful for debugging and tests). ",
         "minimum": 0,
         "title": "Exit Iteration",
         "type": "integer"
      },
      "exit_iteration_this_run": {
         "default": 0,
         "description": "Force exit of training after specified iteration count in this run (but will restart running until `exit_iteration` or running out of data/epochs after resume (useful for debugging and tests). ",
         "minimum": 0,
         "title": "Exit Iteration This Run",
         "type": "integer"
      },
      "min_iterations": {
         "default": 0,
         "description": "When >0, the training dataset will be replicated until there is enough data to run this many iterations. ",
         "minimum": 0,
         "title": "Min Iterations",
         "type": "integer"
      },
      "overfit_first_batch": {
         "default": false,
         "description": "Train only on repetitions of the first training batch. Useful for development. ",
         "title": "Overfit First Batch",
         "type": "boolean"
      },
      "mem_profiler": {
         "default": null,
         "description": "Enable memory profiling. ",
         "enum": [
            null,
            "step",
            "e2e"
         ],
         "title": "Mem Profiler"
      },
      "mem_profiler_dir": {
         "description": "Path to save memory profiling results. Defaults to `logger.output_dir/mem-prof`. ",
         "format": "path",
         "title": "Mem Profiler Dir",
         "type": "string"
      },
      "mem_profiler_max_entries": {
         "default": 100000,
         "description": "Maximum number of entries to store in the memory profiler. ",
         "minimum": 1,
         "title": "Mem Profiler Max Entries",
         "type": "integer"
      },
      "kill_switch_path": {
         "default": "/tmp/at_kill_switch",
         "description": "Path to a file that can be used to trigger a graceful shutdown mid-training (sets early exit to True). ",
         "format": "path",
         "title": "Kill Switch Path",
         "type": "string"
      }
   },
   "$defs": {
      "CheckpointConfig": {
         "additionalProperties": false,
         "properties": {
            "local_rank": {
               "title": "Local Rank",
               "type": "integer"
            },
            "global_rank": {
               "title": "Global Rank",
               "type": "integer"
            },
            "world_size": {
               "title": "World Size",
               "type": "integer"
            },
            "type": {
               "default": "",
               "description": "Checkpoint engine type. ",
               "title": "Type",
               "type": "string"
            },
            "output_dir": {
               "description": "Checkpoint output directory. If directory does not exist, it will be created. ",
               "format": "path",
               "title": "Output Dir",
               "type": "string"
            },
            "enabled": {
               "default": true,
               "description": "Enable this checkpoint engine. ",
               "title": "Enabled",
               "type": "boolean"
            },
            "auto_resume": {
               "default": false,
               "description": "If a checkpoint is found in the output directory, resume training from that checkpoint. ",
               "title": "Auto Resume",
               "type": "boolean"
            },
            "save_every_n_steps": {
               "default": 0,
               "description": "How often to trigger a checkpoint save by training global step count. ",
               "minimum": 0,
               "title": "Save Every N Steps",
               "type": "integer"
            },
            "save_every_n_epochs": {
               "default": 0,
               "description": "How often to trigger a checkpoint save by training epoch count. ",
               "minimum": 0,
               "title": "Save Every N Epochs",
               "type": "integer"
            },
            "save_end_of_training": {
               "default": false,
               "description": "Whether to save a checkpoint at the end of training. ",
               "title": "Save End Of Training",
               "type": "boolean"
            }
         },
         "required": [
            "output_dir"
         ],
         "title": "CheckpointConfig",
         "type": "object"
      },
      "DataConfig": {
         "additionalProperties": false,
         "properties": {
            "local_rank": {
               "title": "Local Rank",
               "type": "integer"
            },
            "global_rank": {
               "title": "Global Rank",
               "type": "integer"
            },
            "world_size": {
               "title": "World Size",
               "type": "integer"
            },
            "type": {
               "default": "",
               "description": "Data factory type. Defaults to the `data_factory_type` in the trainer. ",
               "title": "Type",
               "type": "string"
            },
            "sources": {
               "description": "List of data sources to use for training. These must be registered `DataSource`. ",
               "items": {
                  "$ref": "#/$defs/DataSourceConfig"
               },
               "title": "Sources",
               "type": "array"
            },
            "eval_sources": {
               "default": [],
               "description": "list of data sources to use for evaluation. These must be registered `DataSource`. ",
               "items": {
                  "$ref": "#/$defs/DataSourceConfig"
               },
               "title": "Eval Sources",
               "type": "array"
            },
            "train_eval_split": {
               "default": [
                  1.0,
                  0.0
               ],
               "description": "How much of the training data to use for evaluation. ",
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "number"
                  }
               ],
               "title": "Train Eval Split",
               "type": "array"
            },
            "max_length": {
               "default": 8192,
               "description": "Maximum length of the input sequence. ",
               "title": "Max Length",
               "type": "integer"
            },
            "num_proc": {
               "default": 16,
               "description": "Number of processes to use for data loading. ",
               "title": "Num Proc",
               "type": "integer"
            },
            "dl_num_workers": {
               "default": 2,
               "description": "Number of DL workers per gpu. ",
               "title": "Dl Num Workers",
               "type": "integer"
            },
            "seed": {
               "default": 42,
               "description": "Seed for data loading. ",
               "title": "Seed",
               "type": "integer"
            },
            "use_data_cache": {
               "anyOf": [
                  {
                     "type": "boolean"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Whether to cache loaded data. ",
               "title": "Use Data Cache"
            },
            "cache_processed_data": {
               "anyOf": [
                  {
                     "type": "boolean"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Deprecated, please use \"use_data_cache\". ",
               "title": "Cache Processed Data"
            },
            "cache_dir": {
               "default": "/tmp",
               "description": "Directory to store cached data. ",
               "format": "path",
               "title": "Cache Dir",
               "type": "string"
            },
            "cache_fs_type": {
               "default": "auto",
               "enum": [
                  "auto",
                  "local",
                  "shared"
               ],
               "title": "Cache Fs Type",
               "type": "string"
            },
            "fail_on_missing_cache": {
               "default": false,
               "description": "Whether to fail if the cache is missing. ",
               "title": "Fail On Missing Cache",
               "type": "boolean"
            }
         },
         "required": [
            "sources"
         ],
         "title": "DataConfig",
         "type": "object"
      },
      "DataSourceConfig": {
         "additionalProperties": false,
         "description": "Base DataSource configuration.",
         "properties": {
            "local_rank": {
               "title": "Local Rank",
               "type": "integer"
            },
            "global_rank": {
               "title": "Global Rank",
               "type": "integer"
            },
            "world_size": {
               "title": "World Size",
               "type": "integer"
            },
            "type": {
               "default": "",
               "description": "Data source type. Defaults to 'huggingface' if only a dataset name or path is provided.",
               "title": "Type",
               "type": "string"
            },
            "split": {
               "default": "",
               "description": "Which split to load for a given data source. This will be automatically set to either \"train\" or \"eval\" if no value is passed.\n\nFor HFDataSource, this can be any value supported by Dataset slice splits:\nhttps://huggingface.co/docs/datasets/en/loading#slice-splits.",
               "title": "Split",
               "type": "string"
            },
            "sample_ratio": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Ratio of the dataset to randomly sample. If None, all examples are used.",
               "title": "Sample Ratio"
            },
            "sample_count": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Number of examples to randomly sample. If None, all examples are used.",
               "title": "Sample Count"
            },
            "sample_seed": {
               "default": 42,
               "description": "Seed for random sampling. Used only if `sample_ratio` or `sample_count` is set.",
               "title": "Sample Seed",
               "type": "integer"
            },
            "process": {
               "default": true,
               "description": "Whether to process the data with the data factory `process` function (e.g., tokenization for SFTDataFactory). ",
               "title": "Process",
               "type": "boolean"
            }
         },
         "title": "DataSourceConfig",
         "type": "object"
      },
      "LoggerConfig": {
         "additionalProperties": false,
         "properties": {
            "local_rank": {
               "title": "Local Rank",
               "type": "integer"
            },
            "global_rank": {
               "title": "Global Rank",
               "type": "integer"
            },
            "world_size": {
               "title": "World Size",
               "type": "integer"
            },
            "output_dir": {
               "default": "/dev/null",
               "description": "Output directory for log files. ",
               "format": "path",
               "title": "Output Dir",
               "type": "string"
            },
            "level": {
               "default": "WARNING",
               "description": "Log level for the logger. ",
               "title": "Level",
               "type": "string"
            },
            "print_output_ranks": {
               "anyOf": [
                  {
                     "const": "*",
                     "type": "string"
                  },
                  {
                     "items": {
                        "type": "integer"
                     },
                     "type": "array"
                  }
               ],
               "default": [
                  0
               ],
               "description": "Which ranks will print logs. Either a list of ranks or \"*\" for all ranks. ",
               "title": "Print Output Ranks"
            },
            "file_output_ranks": {
               "anyOf": [
                  {
                     "const": "*",
                     "type": "string"
                  },
                  {
                     "items": {
                        "type": "integer"
                     },
                     "type": "array"
                  }
               ],
               "default": "*",
               "description": "Which ranks will output logs to a file. Either a list of ranks or \"*\" for all ranks. ",
               "title": "File Output Ranks"
            }
         },
         "title": "LoggerConfig",
         "type": "object"
      },
      "ModelConfig": {
         "additionalProperties": false,
         "properties": {
            "local_rank": {
               "title": "Local Rank",
               "type": "integer"
            },
            "global_rank": {
               "title": "Global Rank",
               "type": "integer"
            },
            "world_size": {
               "title": "World Size",
               "type": "integer"
            },
            "type": {
               "default": "",
               "description": "Model factory type. ",
               "title": "Type",
               "type": "string"
            },
            "name_or_path": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "format": "path",
                     "type": "string"
                  }
               ],
               "description": "Model name (as described in Hugging Face model hub) or local path to model checkpoint. ",
               "title": "Name Or Path"
            },
            "dtype": {
               "default": "torch.bfloat16",
               "description": "Data type for model weights. ",
               "examples": [
                  "float32",
                  "bfloat16"
               ],
               "title": "Torch Dtype",
               "type": "string"
            },
            "save_name": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Name to use when saving the model. ",
               "title": "Save Name"
            },
            "attn_implementation": {
               "default": "sdpa",
               "description": "Attention implementation to use. ",
               "title": "Attn Implementation",
               "type": "string"
            },
            "disable_activation_checkpoint": {
               "default": false,
               "description": "Disable the use of activation checkpointing. ",
               "title": "Disable Activation Checkpoint",
               "type": "boolean"
            },
            "peft_config": {
               "anyOf": [
                  {
                     "additionalProperties": true,
                     "type": "object"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Configuration for Parameter Efficient Fine Tuning. ",
               "title": "Peft Config"
            },
            "hf_config_kwargs": {
               "additionalProperties": true,
               "description": "Optional kwargs to override in the HF model config object created by `AutoConfig.from_pretrained(model.name_or_path)` ",
               "title": "Hf Config Kwargs",
               "type": "object"
            }
         },
         "required": [
            "name_or_path"
         ],
         "title": "ModelConfig",
         "type": "object"
      },
      "OptimizerConfig": {
         "additionalProperties": false,
         "properties": {
            "local_rank": {
               "title": "Local Rank",
               "type": "integer"
            },
            "global_rank": {
               "title": "Global Rank",
               "type": "integer"
            },
            "world_size": {
               "title": "World Size",
               "type": "integer"
            },
            "type": {
               "default": "",
               "description": "Optimizer factory type. Defaults to the `optimizer_factory_type` of the trainer. ",
               "title": "Type",
               "type": "string"
            },
            "weight_decay": {
               "default": 0.1,
               "description": "Coefficient for L2 regularization applied to the optimizer's weights. ",
               "minimum": 0.0,
               "title": "Weight Decay",
               "type": "number"
            },
            "betas": {
               "default": [
                  0.9,
                  0.999
               ],
               "description": "Tuple of coefficients used for computing running averages of gradient and its square (e.g., (beta1, beta2) for Adam). ",
               "maxItems": 2,
               "minItems": 2,
               "prefixItems": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "number"
                  }
               ],
               "title": "Betas",
               "type": "array"
            },
            "lr": {
               "default": 0.0005,
               "description": "The initial learning rate. ",
               "minimum": 0.0,
               "title": "Lr",
               "type": "number"
            }
         },
         "title": "OptimizerConfig",
         "type": "object"
      },
      "SchedulerConfig": {
         "additionalProperties": false,
         "properties": {
            "local_rank": {
               "title": "Local Rank",
               "type": "integer"
            },
            "global_rank": {
               "title": "Global Rank",
               "type": "integer"
            },
            "world_size": {
               "title": "World Size",
               "type": "integer"
            },
            "type": {
               "default": "",
               "description": "Scheduler factory type. Defaults to the `scheduler_factory_type` of the trainer. ",
               "title": "Type",
               "type": "string"
            },
            "lr": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "The initial learning rate. Deprecated in favor of `optimizer.learning_rate`. ",
               "title": "Lr"
            }
         },
         "title": "SchedulerConfig",
         "type": "object"
      },
      "TokenizerConfig": {
         "additionalProperties": false,
         "properties": {
            "local_rank": {
               "title": "Local Rank",
               "type": "integer"
            },
            "global_rank": {
               "title": "Global Rank",
               "type": "integer"
            },
            "world_size": {
               "title": "World Size",
               "type": "integer"
            },
            "type": {
               "default": "",
               "description": "Tokenizer factory type. Defaults to the `tokenizer_factory_type` of the trainer. ",
               "title": "Type",
               "type": "string"
            },
            "name_or_path": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "format": "path",
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "",
               "description": "Tokenizer name (as described in Hugging Face model hub) or local path directory containing tokenizer. ",
               "title": "Name Or Path"
            },
            "tokenize_kwargs": {
               "additionalProperties": true,
               "description": "Optional kwargs to be passed to tokenizer.tokenize in addition or to override the default values passed in the corresponding data factory's process function",
               "title": "Tokenize Kwargs",
               "type": "object"
            }
         },
         "title": "TokenizerConfig",
         "type": "object"
      },
      "WandBConfig": {
         "additionalProperties": false,
         "properties": {
            "local_rank": {
               "title": "Local Rank",
               "type": "integer"
            },
            "global_rank": {
               "title": "Global Rank",
               "type": "integer"
            },
            "world_size": {
               "title": "World Size",
               "type": "integer"
            },
            "enable": {
               "default": false,
               "description": "Whether to enable Weights and Biases logging. ",
               "title": "Enable",
               "type": "boolean"
            },
            "entity": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Weights and Biases entity name. ",
               "title": "Entity"
            },
            "project": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": "arctic-training",
               "description": "Weights and Biases project name. ",
               "title": "Project"
            },
            "name": {
               "anyOf": [
                  {
                     "type": "string"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Weights and Biases run name. ",
               "title": "Name"
            }
         },
         "title": "WandBConfig",
         "type": "object"
      }
   },
   "additionalProperties": false,
   "required": [
      "model",
      "data"
   ]
}

Config:
  • extra: str = forbid

  • use_enum_values: bool = True

  • validate_default: bool = True

  • use_attribute_docstrings: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
  • build_deepspeed_config » all fields

  • coerce_deepspeed_human_friendly_values » deepspeed

  • init_checkpoint_configs » checkpoint

  • init_data_config » data

  • init_dist » all fields

  • init_model_config » model

  • init_optimizer_config » optimizer

  • init_scheduler_config » scheduler

  • init_tokenizer_config » tokenizer

  • initialize_logger » logger

  • mem_profiler_mkdir » all fields

  • set_max_length » all fields

  • set_tokenizer » all fields

  • train_log_metrics_path_prep » all fields

  • validate_eval_interval » all fields

  • validate_sft_sample_packing » all fields

  • validate_single_checkpoint_resume » all fields

field type: str = 'sft'

Trainer type.

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field code: Path = PosixPath('train.py')

Path to the python script containing custom trainer implementation.

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field skip_validation: bool = False

Skips validation of types for subconfigs and registered classes.

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field model: ModelConfig [Required]

Model configuration.

Validated by:
  • build_deepspeed_config

  • init_dist

  • init_model_config

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field tokenizer: TokenizerConfig [Optional]

Tokenizer configuration.

Validated by:
  • build_deepspeed_config

  • init_dist

  • init_tokenizer_config

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field data: DataConfig [Required]

Train and eval data configuration.

Validated by:
  • build_deepspeed_config

  • init_data_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field logger: LoggerConfig [Optional]

Logger configuration.

Validated by:
  • build_deepspeed_config

  • init_dist

  • initialize_logger

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field wandb: WandBConfig [Optional]

Weights and Biases configuration.

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field scheduler: SchedulerConfig [Optional]

Scheduler configuration.

Validated by:
  • build_deepspeed_config

  • init_dist

  • init_scheduler_config

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field optimizer: OptimizerConfig [Optional]

Optimizer configuration.

Validated by:
  • build_deepspeed_config

  • init_dist

  • init_optimizer_config

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field deepspeed: Dict[str, Any] = {}

DeepSpeed config dict. Will be automatically filled if not provided by the user.

Validated by:
  • build_deepspeed_config

  • coerce_deepspeed_human_friendly_values

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field epochs: int = 1

Number of epochs to train.

Constraints:
  • ge = 0

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field loss_log_interval: Annotated[int] = 1

Number of steps between logging loss.

Constraints:
  • ge = 0

  • func = <function parse_human_val at 0x78ba4229a950>

  • json_schema_input_type = PydanticUndefined

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field train_log_iter_interval: Literal[0, 1] = 1

Iters between training metric log outputs. 0 is off, only intervals of 1 currently supported.

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field train_log_metrics_path: Path = PosixPath('train-log-metrics.jsonl')

.jsonl path to log precise metrics according to the train_log_iter_interval schedule. Defaults to ./train-log-metrics.jsonl

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field gradient_accumulation_steps: int = 1

Number of gradient accumulation steps.

Constraints:
  • ge = 1

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field micro_batch_size: int = 1

Micro batch size per GPU.

Constraints:
  • ge = 1

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field sequence_parallel_size: int = 1

Sequence Parallelism Degree. Disabled if set to 1

Constraints:
  • ge = 1

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field activation_checkpoint_cpu_offload: bool = False

Offload activation checkpoint tensors to cpu. Enables a much longer sequence length. It is not very beneficial if sequence length is <64k

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field tiled_mlp_compute: bool = False

Tile the MLP computation to save GPU memory. Currently only limited architectures supported, but can be expanded to more.

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field seed: int = 42

Random seed value for numpy, python.random, torch, and transformers.

Constraints:
  • ge = 0

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field checkpoint: List[CheckpointConfig] = []

Checkpoint configurations. Multiple checkpoint engines may be used together.

Validated by:
  • build_deepspeed_config

  • init_checkpoint_configs

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field train_iters: Annotated[int] = 0

Maximum number of training iterations.

Constraints:
  • ge = 0

  • func = <function parse_human_val at 0x78ba4229a950>

  • json_schema_input_type = PydanticUndefined

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field eval_interval: Annotated[int] = 0

Number of iterations between evaluations. If 0, no evaluation is performed.

Constraints:
  • ge = 0

  • func = <function parse_human_val at 0x78ba4229a950>

  • json_schema_input_type = PydanticUndefined

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field eval_log_iter_interval: Annotated[int] = 1

Iters between eval metric log outputs. 0 is off.

Constraints:
  • ge = 0

  • func = <function parse_human_val at 0x78ba4229a950>

  • json_schema_input_type = PydanticUndefined

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field exit_iteration: int = 0

Do not continue training after specified iteration count even if there is still data and epochs to run (useful for debugging and tests).

Constraints:
  • ge = 0

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field exit_iteration_this_run: int = 0

Force exit of training after specified iteration count in this run (but will restart running until exit_iteration or running out of data/epochs after resume (useful for debugging and tests).

Constraints:
  • ge = 0

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field min_iterations: Annotated[int] = 0

When >0, the training dataset will be replicated until there is enough data to run this many iterations.

Constraints:
  • ge = 0

  • func = <function parse_human_val at 0x78ba4229a950>

  • json_schema_input_type = PydanticUndefined

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field overfit_first_batch: bool = False

Train only on repetitions of the first training batch. Useful for development.

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field mem_profiler: Literal[None, 'step', 'e2e'] = None

Enable memory profiling.

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field mem_profiler_dir: Path [Optional]

Path to save memory profiling results. Defaults to logger.output_dir/mem-prof.

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field mem_profiler_max_entries: Annotated[int] = 100000

Maximum number of entries to store in the memory profiler.

Constraints:
  • ge = 1

  • func = <function parse_human_val at 0x78ba4229a950>

  • json_schema_input_type = PydanticUndefined

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

field kill_switch_path: Path = PosixPath('/tmp/at_kill_switch')

Path to a file that can be used to trigger a graceful shutdown mid-training (sets early exit to True).

Validated by:
  • build_deepspeed_config

  • init_dist

  • mem_profiler_mkdir

  • set_max_length

  • set_tokenizer

  • train_log_metrics_path_prep

  • validate_eval_interval

  • validate_sft_sample_packing

  • validate_single_checkpoint_resume

pydantic model arctic_training.config.checkpoint.CheckpointConfig[source]

Bases: BaseConfig

Show JSON schema
{
   "title": "CheckpointConfig",
   "type": "object",
   "properties": {
      "local_rank": {
         "title": "Local Rank",
         "type": "integer"
      },
      "global_rank": {
         "title": "Global Rank",
         "type": "integer"
      },
      "world_size": {
         "title": "World Size",
         "type": "integer"
      },
      "type": {
         "default": "",
         "description": "Checkpoint engine type. ",
         "title": "Type",
         "type": "string"
      },
      "output_dir": {
         "description": "Checkpoint output directory. If directory does not exist, it will be created. ",
         "format": "path",
         "title": "Output Dir",
         "type": "string"
      },
      "enabled": {
         "default": true,
         "description": "Enable this checkpoint engine. ",
         "title": "Enabled",
         "type": "boolean"
      },
      "auto_resume": {
         "default": false,
         "description": "If a checkpoint is found in the output directory, resume training from that checkpoint. ",
         "title": "Auto Resume",
         "type": "boolean"
      },
      "save_every_n_steps": {
         "default": 0,
         "description": "How often to trigger a checkpoint save by training global step count. ",
         "minimum": 0,
         "title": "Save Every N Steps",
         "type": "integer"
      },
      "save_every_n_epochs": {
         "default": 0,
         "description": "How often to trigger a checkpoint save by training epoch count. ",
         "minimum": 0,
         "title": "Save Every N Epochs",
         "type": "integer"
      },
      "save_end_of_training": {
         "default": false,
         "description": "Whether to save a checkpoint at the end of training. ",
         "title": "Save End Of Training",
         "type": "boolean"
      }
   },
   "additionalProperties": false,
   "required": [
      "output_dir"
   ]
}

Config:
  • extra: str = forbid

  • use_enum_values: bool = True

  • validate_default: bool = True

  • use_attribute_docstrings: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
field type: str = ''

Checkpoint engine type.

field output_dir: Path [Required]

Checkpoint output directory. If directory does not exist, it will be created.

Validated by:
  • resolve_output_dir

field enabled: bool = True

Enable this checkpoint engine.

field auto_resume: bool = False

If a checkpoint is found in the output directory, resume training from that checkpoint.

field save_every_n_steps: Annotated[int] = 0

How often to trigger a checkpoint save by training global step count.

Constraints:
  • ge = 0

  • func = <function parse_human_val at 0x78ba4229a950>

  • json_schema_input_type = PydanticUndefined

field save_every_n_epochs: Annotated[int] = 0

How often to trigger a checkpoint save by training epoch count.

Constraints:
  • ge = 0

  • func = <function parse_human_val at 0x78ba4229a950>

  • json_schema_input_type = PydanticUndefined

field save_end_of_training: bool = False

Whether to save a checkpoint at the end of training.

pydantic model arctic_training.config.data.DataConfig[source]

Bases: BaseConfig

Show JSON schema
{
   "title": "DataConfig",
   "type": "object",
   "properties": {
      "local_rank": {
         "title": "Local Rank",
         "type": "integer"
      },
      "global_rank": {
         "title": "Global Rank",
         "type": "integer"
      },
      "world_size": {
         "title": "World Size",
         "type": "integer"
      },
      "type": {
         "default": "",
         "description": "Data factory type. Defaults to the `data_factory_type` in the trainer. ",
         "title": "Type",
         "type": "string"
      },
      "sources": {
         "description": "List of data sources to use for training. These must be registered `DataSource`. ",
         "items": {
            "$ref": "#/$defs/DataSourceConfig"
         },
         "title": "Sources",
         "type": "array"
      },
      "eval_sources": {
         "default": [],
         "description": "list of data sources to use for evaluation. These must be registered `DataSource`. ",
         "items": {
            "$ref": "#/$defs/DataSourceConfig"
         },
         "title": "Eval Sources",
         "type": "array"
      },
      "train_eval_split": {
         "default": [
            1.0,
            0.0
         ],
         "description": "How much of the training data to use for evaluation. ",
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "type": "number"
            },
            {
               "type": "number"
            }
         ],
         "title": "Train Eval Split",
         "type": "array"
      },
      "max_length": {
         "default": 8192,
         "description": "Maximum length of the input sequence. ",
         "title": "Max Length",
         "type": "integer"
      },
      "num_proc": {
         "default": 16,
         "description": "Number of processes to use for data loading. ",
         "title": "Num Proc",
         "type": "integer"
      },
      "dl_num_workers": {
         "default": 2,
         "description": "Number of DL workers per gpu. ",
         "title": "Dl Num Workers",
         "type": "integer"
      },
      "seed": {
         "default": 42,
         "description": "Seed for data loading. ",
         "title": "Seed",
         "type": "integer"
      },
      "use_data_cache": {
         "anyOf": [
            {
               "type": "boolean"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Whether to cache loaded data. ",
         "title": "Use Data Cache"
      },
      "cache_processed_data": {
         "anyOf": [
            {
               "type": "boolean"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Deprecated, please use \"use_data_cache\". ",
         "title": "Cache Processed Data"
      },
      "cache_dir": {
         "default": "/tmp",
         "description": "Directory to store cached data. ",
         "format": "path",
         "title": "Cache Dir",
         "type": "string"
      },
      "cache_fs_type": {
         "default": "auto",
         "enum": [
            "auto",
            "local",
            "shared"
         ],
         "title": "Cache Fs Type",
         "type": "string"
      },
      "fail_on_missing_cache": {
         "default": false,
         "description": "Whether to fail if the cache is missing. ",
         "title": "Fail On Missing Cache",
         "type": "boolean"
      }
   },
   "$defs": {
      "DataSourceConfig": {
         "additionalProperties": false,
         "description": "Base DataSource configuration.",
         "properties": {
            "local_rank": {
               "title": "Local Rank",
               "type": "integer"
            },
            "global_rank": {
               "title": "Global Rank",
               "type": "integer"
            },
            "world_size": {
               "title": "World Size",
               "type": "integer"
            },
            "type": {
               "default": "",
               "description": "Data source type. Defaults to 'huggingface' if only a dataset name or path is provided.",
               "title": "Type",
               "type": "string"
            },
            "split": {
               "default": "",
               "description": "Which split to load for a given data source. This will be automatically set to either \"train\" or \"eval\" if no value is passed.\n\nFor HFDataSource, this can be any value supported by Dataset slice splits:\nhttps://huggingface.co/docs/datasets/en/loading#slice-splits.",
               "title": "Split",
               "type": "string"
            },
            "sample_ratio": {
               "anyOf": [
                  {
                     "type": "number"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Ratio of the dataset to randomly sample. If None, all examples are used.",
               "title": "Sample Ratio"
            },
            "sample_count": {
               "anyOf": [
                  {
                     "type": "integer"
                  },
                  {
                     "type": "null"
                  }
               ],
               "default": null,
               "description": "Number of examples to randomly sample. If None, all examples are used.",
               "title": "Sample Count"
            },
            "sample_seed": {
               "default": 42,
               "description": "Seed for random sampling. Used only if `sample_ratio` or `sample_count` is set.",
               "title": "Sample Seed",
               "type": "integer"
            },
            "process": {
               "default": true,
               "description": "Whether to process the data with the data factory `process` function (e.g., tokenization for SFTDataFactory). ",
               "title": "Process",
               "type": "boolean"
            }
         },
         "title": "DataSourceConfig",
         "type": "object"
      }
   },
   "additionalProperties": false,
   "required": [
      "sources"
   ]
}

Config:
  • extra: str = forbid

  • use_enum_values: bool = True

  • validate_default: bool = True

  • use_attribute_docstrings: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
field type: str = ''

Data factory type. Defaults to the data_factory_type in the trainer.

Validated by:
  • set_cache_fs_type

  • validate_cache_dir

  • validate_train_eval_split

field sources: List[DataSourceConfig] [Required]

List of data sources to use for training. These must be registered DataSource.

Validated by:
field eval_sources: List[DataSourceConfig] = []

list of data sources to use for evaluation. These must be registered DataSource.

Validated by:
field train_eval_split: Tuple[float, float] = (1.0, 0.0)

How much of the training data to use for evaluation.

Validated by:
  • set_cache_fs_type

  • validate_cache_dir

  • validate_train_eval_split

field max_length: Annotated[int] = 8192

Maximum length of the input sequence.

Constraints:
  • func = <function parse_human_val at 0x78ba4229a950>

  • json_schema_input_type = PydanticUndefined

Validated by:
  • set_cache_fs_type

  • validate_cache_dir

  • validate_train_eval_split

field num_proc: int = 16

Number of processes to use for data loading.

Validated by:
  • set_cache_fs_type

  • validate_cache_dir

  • validate_train_eval_split

field dl_num_workers: int = 2

Number of DL workers per gpu.

Validated by:
  • set_cache_fs_type

  • validate_cache_dir

  • validate_train_eval_split

field seed: int = 42

Seed for data loading.

Validated by:
  • set_cache_fs_type

  • validate_cache_dir

  • validate_train_eval_split

field use_data_cache: Optional[bool] = None

Whether to cache loaded data.

Validated by:
  • deprecate_cache_processed_data

  • set_cache_fs_type

  • validate_cache_dir

  • validate_train_eval_split

field cache_processed_data: Optional[bool] = None

Deprecated, please use “use_data_cache”.

Validated by:
  • deprecate_cache_processed_data

  • set_cache_fs_type

  • validate_cache_dir

  • validate_train_eval_split

field cache_dir: Path = PosixPath('/tmp')

Directory to store cached data.

Validated by:
  • resolve_cache_dir

  • set_cache_fs_type

  • validate_cache_dir

  • validate_train_eval_split

field fail_on_missing_cache: bool = False

Whether to fail if the cache is missing.

Validated by:
  • set_cache_fs_type

  • validate_cache_dir

  • validate_train_eval_split

validator init_source_configs  »  eval_sources, sources[source]

Convert string and dict input to correct subclass of DataSourceConfig.

Return type:

List[DataSourceConfig]

Parameters:
  • v (List[str | Dict | DataSourceConfig])

  • info (ValidationInfo)

Note

If data.max_length is not set in your configuration, it will be automatically set to the value of model.config.max_position_embeddings (if available) from the HuggingFace model config. If your model config does not have this attribute, you must set max_length manually to avoid errors with sequence lengths, which are longer than what the model was built to handle.

pydantic model arctic_training.config.logger.LoggerConfig[source]

Bases: BaseConfig

Show JSON schema
{
   "title": "LoggerConfig",
   "type": "object",
   "properties": {
      "local_rank": {
         "title": "Local Rank",
         "type": "integer"
      },
      "global_rank": {
         "title": "Global Rank",
         "type": "integer"
      },
      "world_size": {
         "title": "World Size",
         "type": "integer"
      },
      "output_dir": {
         "default": "/dev/null",
         "description": "Output directory for log files. ",
         "format": "path",
         "title": "Output Dir",
         "type": "string"
      },
      "level": {
         "default": "WARNING",
         "description": "Log level for the logger. ",
         "title": "Level",
         "type": "string"
      },
      "print_output_ranks": {
         "anyOf": [
            {
               "const": "*",
               "type": "string"
            },
            {
               "items": {
                  "type": "integer"
               },
               "type": "array"
            }
         ],
         "default": [
            0
         ],
         "description": "Which ranks will print logs. Either a list of ranks or \"*\" for all ranks. ",
         "title": "Print Output Ranks"
      },
      "file_output_ranks": {
         "anyOf": [
            {
               "const": "*",
               "type": "string"
            },
            {
               "items": {
                  "type": "integer"
               },
               "type": "array"
            }
         ],
         "default": "*",
         "description": "Which ranks will output logs to a file. Either a list of ranks or \"*\" for all ranks. ",
         "title": "File Output Ranks"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

  • use_enum_values: bool = True

  • validate_default: bool = True

  • use_attribute_docstrings: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
  • fill_output_ranks » all fields

  • set_wandb_output_dir » all fields

field output_dir: Path = PosixPath('/dev/null')

Output directory for log files.

Validated by:
  • fill_output_ranks

  • set_wandb_output_dir

field level: str = 'WARNING'

Log level for the logger.

Validated by:
  • fill_output_ranks

  • set_wandb_output_dir

field print_output_ranks: Union[Literal['*'], List[int]] = [0]

Which ranks will print logs. Either a list of ranks or “*” for all ranks.

Validated by:
  • fill_output_ranks

  • set_wandb_output_dir

field file_output_ranks: Union[Literal['*'], List[int]] = '*'

Which ranks will output logs to a file. Either a list of ranks or “*” for all ranks.

Validated by:
  • fill_output_ranks

  • set_wandb_output_dir

pydantic model arctic_training.config.model.ModelConfig[source]

Bases: BaseConfig

Show JSON schema
{
   "title": "ModelConfig",
   "type": "object",
   "properties": {
      "local_rank": {
         "title": "Local Rank",
         "type": "integer"
      },
      "global_rank": {
         "title": "Global Rank",
         "type": "integer"
      },
      "world_size": {
         "title": "World Size",
         "type": "integer"
      },
      "type": {
         "default": "",
         "description": "Model factory type. ",
         "title": "Type",
         "type": "string"
      },
      "name_or_path": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "format": "path",
               "type": "string"
            }
         ],
         "description": "Model name (as described in Hugging Face model hub) or local path to model checkpoint. ",
         "title": "Name Or Path"
      },
      "dtype": {
         "default": "torch.bfloat16",
         "description": "Data type for model weights. ",
         "examples": [
            "float32",
            "bfloat16"
         ],
         "title": "Torch Dtype",
         "type": "string"
      },
      "save_name": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Name to use when saving the model. ",
         "title": "Save Name"
      },
      "attn_implementation": {
         "default": "sdpa",
         "description": "Attention implementation to use. ",
         "title": "Attn Implementation",
         "type": "string"
      },
      "disable_activation_checkpoint": {
         "default": false,
         "description": "Disable the use of activation checkpointing. ",
         "title": "Disable Activation Checkpoint",
         "type": "boolean"
      },
      "peft_config": {
         "anyOf": [
            {
               "additionalProperties": true,
               "type": "object"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Configuration for Parameter Efficient Fine Tuning. ",
         "title": "Peft Config"
      },
      "hf_config_kwargs": {
         "additionalProperties": true,
         "description": "Optional kwargs to override in the HF model config object created by `AutoConfig.from_pretrained(model.name_or_path)` ",
         "title": "Hf Config Kwargs",
         "type": "object"
      }
   },
   "additionalProperties": false,
   "required": [
      "name_or_path"
   ]
}

Config:
  • extra: str = forbid

  • use_enum_values: bool = True

  • validate_default: bool = True

  • use_attribute_docstrings: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
field type: str = ''

Model factory type.

field name_or_path: Union[str, Path] [Required]

Model name (as described in Hugging Face model hub) or local path to model checkpoint.

field dtype: DType = DType.BF16

Data type for model weights.

field save_name: Optional[str] = None

Name to use when saving the model.

field attn_implementation: str = 'sdpa'

Attention implementation to use.

Validated by:
  • validate_attn_implementation

field disable_activation_checkpoint: bool = False

Disable the use of activation checkpointing.

field peft_config: Optional[Dict] = None

Configuration for Parameter Efficient Fine Tuning.

Validated by:
  • validate_peft_config_type

field hf_config_kwargs: Dict [Optional]

Optional kwargs to override in the HF model config object created by AutoConfig.from_pretrained(model.name_or_path)

pydantic model arctic_training.config.optimizer.OptimizerConfig[source]

Bases: BaseConfig

Show JSON schema
{
   "title": "OptimizerConfig",
   "type": "object",
   "properties": {
      "local_rank": {
         "title": "Local Rank",
         "type": "integer"
      },
      "global_rank": {
         "title": "Global Rank",
         "type": "integer"
      },
      "world_size": {
         "title": "World Size",
         "type": "integer"
      },
      "type": {
         "default": "",
         "description": "Optimizer factory type. Defaults to the `optimizer_factory_type` of the trainer. ",
         "title": "Type",
         "type": "string"
      },
      "weight_decay": {
         "default": 0.1,
         "description": "Coefficient for L2 regularization applied to the optimizer's weights. ",
         "minimum": 0.0,
         "title": "Weight Decay",
         "type": "number"
      },
      "betas": {
         "default": [
            0.9,
            0.999
         ],
         "description": "Tuple of coefficients used for computing running averages of gradient and its square (e.g., (beta1, beta2) for Adam). ",
         "maxItems": 2,
         "minItems": 2,
         "prefixItems": [
            {
               "type": "number"
            },
            {
               "type": "number"
            }
         ],
         "title": "Betas",
         "type": "array"
      },
      "lr": {
         "default": 0.0005,
         "description": "The initial learning rate. ",
         "minimum": 0.0,
         "title": "Lr",
         "type": "number"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

  • use_enum_values: bool = True

  • validate_default: bool = True

  • use_attribute_docstrings: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
field type: str = ''

Optimizer factory type. Defaults to the optimizer_factory_type of the trainer.

field weight_decay: Annotated[float] = 0.1

Coefficient for L2 regularization applied to the optimizer’s weights.

Constraints:
  • ge = 0.0

  • func = <function parse_human_val at 0x78ba4229a950>

  • json_schema_input_type = PydanticUndefined

field betas: Tuple[float, float] = (0.9, 0.999)

Tuple of coefficients used for computing running averages of gradient and its square (e.g., (beta1, beta2) for Adam).

field learning_rate: Annotated[float] = 0.0005 (alias 'lr')

The initial learning rate.

Constraints:
  • ge = 0.0

  • func = <function parse_human_val at 0x78ba4229a950>

  • json_schema_input_type = PydanticUndefined

pydantic model arctic_training.config.scheduler.SchedulerConfig[source]

Bases: BaseConfig

Show JSON schema
{
   "title": "SchedulerConfig",
   "type": "object",
   "properties": {
      "local_rank": {
         "title": "Local Rank",
         "type": "integer"
      },
      "global_rank": {
         "title": "Global Rank",
         "type": "integer"
      },
      "world_size": {
         "title": "World Size",
         "type": "integer"
      },
      "type": {
         "default": "",
         "description": "Scheduler factory type. Defaults to the `scheduler_factory_type` of the trainer. ",
         "title": "Type",
         "type": "string"
      },
      "lr": {
         "anyOf": [
            {
               "type": "number"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "The initial learning rate. Deprecated in favor of `optimizer.learning_rate`. ",
         "title": "Lr"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

  • use_enum_values: bool = True

  • validate_default: bool = True

  • use_attribute_docstrings: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
Validators:
field type: str = ''

Scheduler factory type. Defaults to the scheduler_factory_type of the trainer.

field learning_rate: Optional[float] = None (alias 'lr')

The initial learning rate. Deprecated in favor of optimizer.learning_rate.

Validated by:
  • _deprecated_learning_rate

pydantic model arctic_training.config.tokenizer.TokenizerConfig[source]

Bases: BaseConfig

Show JSON schema
{
   "title": "TokenizerConfig",
   "type": "object",
   "properties": {
      "local_rank": {
         "title": "Local Rank",
         "type": "integer"
      },
      "global_rank": {
         "title": "Global Rank",
         "type": "integer"
      },
      "world_size": {
         "title": "World Size",
         "type": "integer"
      },
      "type": {
         "default": "",
         "description": "Tokenizer factory type. Defaults to the `tokenizer_factory_type` of the trainer. ",
         "title": "Type",
         "type": "string"
      },
      "name_or_path": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "format": "path",
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": "",
         "description": "Tokenizer name (as described in Hugging Face model hub) or local path directory containing tokenizer. ",
         "title": "Name Or Path"
      },
      "tokenize_kwargs": {
         "additionalProperties": true,
         "description": "Optional kwargs to be passed to tokenizer.tokenize in addition or to override the default values passed in the corresponding data factory's process function",
         "title": "Tokenize Kwargs",
         "type": "object"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

  • use_enum_values: bool = True

  • validate_default: bool = True

  • use_attribute_docstrings: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
field type: str = ''

Tokenizer factory type. Defaults to the tokenizer_factory_type of the trainer.

field name_or_path: Union[str, Path, None] = ''

Tokenizer name (as described in Hugging Face model hub) or local path directory containing tokenizer.

field tokenize_kwargs: Dict [Optional]

Optional kwargs to be passed to tokenizer.tokenize in addition or to override the default values passed in the corresponding data factory’s process function

pydantic model arctic_training.config.wandb.WandBConfig[source]

Bases: BaseConfig

Show JSON schema
{
   "title": "WandBConfig",
   "type": "object",
   "properties": {
      "local_rank": {
         "title": "Local Rank",
         "type": "integer"
      },
      "global_rank": {
         "title": "Global Rank",
         "type": "integer"
      },
      "world_size": {
         "title": "World Size",
         "type": "integer"
      },
      "enable": {
         "default": false,
         "description": "Whether to enable Weights and Biases logging. ",
         "title": "Enable",
         "type": "boolean"
      },
      "entity": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Weights and Biases entity name. ",
         "title": "Entity"
      },
      "project": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": "arctic-training",
         "description": "Weights and Biases project name. ",
         "title": "Project"
      },
      "name": {
         "anyOf": [
            {
               "type": "string"
            },
            {
               "type": "null"
            }
         ],
         "default": null,
         "description": "Weights and Biases run name. ",
         "title": "Name"
      }
   },
   "additionalProperties": false
}

Config:
  • extra: str = forbid

  • use_enum_values: bool = True

  • validate_default: bool = True

  • use_attribute_docstrings: bool = True

  • populate_by_name: bool = True

  • validate_by_alias: bool = True

  • validate_by_name: bool = True

Fields:
field enable: bool = False

Whether to enable Weights and Biases logging.

field entity: Optional[str] = None

Weights and Biases entity name.

field project: Optional[str] = 'arctic-training'

Weights and Biases project name.

field name: Optional[str] = None

Weights and Biases run name.

Numerical Formatting

When specifying numerical values in the configuration file, you can use human-friendly strings to represent very large or very small numbers. The following formats are supported:

  • X%: This format represents a percentage. For example, 50% is equivalent to 0.5.

  • XeY: This format represents a number in scientific notation. For example, 1e-6 is equivalent to 0.000001.

  • X^Y: This format represents a number raised to a power. For example, 2^20 is equivalent to 1048576.

  • XK: This format represents a number in thousands (base 10). For example, 1K is equivalent to 1000. Similarly you can use M for millions, B for billions, and T for trillions.

  • 1Ki: This format represents a number in kibibytes (base 2). For example, 1Ki is equivalent to 1024. Similarly you can use Mi for mebibytes, Gi for gibibytes, and Ti for tebibytes.