Usage

After installation, you can use ArcticTraining to train your models using a simple YAML recipe or a Python script. Here we provide an overview of how to use each.

ArcticTraining CLI

The ArcticTraining CLI is the easiest way to train your models using ArcticTraining and supports the use of custom trainers, data, etc. to meet your specific requirements. To train a model using the ArcticTraining CLI, follow these steps:

Create a training recipe YAML file with the necessary configuration options. For example, you can create a recipe to train a model using the meta-llama/Llama-3.1-8B-Instruct model and the HuggingFaceH4/ultrachat_200k dataset:

model:
  name_or_path: meta-llama/Llama-3.1-8B-Instruct
data:
  sources:
    - HuggingFaceH4/ultrachat_200k
checkpoint:
  - type: huggingface
    save_end_of_training: true
    output_dir: ./fine-tuned-model

Optionally create a custom trainer by subclassing the Trainer or SFTTrainer classes and implementing the necessary modifications. For example, you could create a new trainer from SFTTrainer that uses a different loss function:

from arctic_training import SFTTrainer

class CustomTrainer(SFTTrainer):
    name = "my_custom_trainer"

    def loss(self, batch):
        # Custom loss function implementation
        return loss

This new trainer will be automatically registered with ArcticTraining when the script containing the declaration of CustomTrainer is imported. By default, ArcticTraining looks for a train.py in the directory where the YAML training recipe is located to find custom trainers. You can also specify a custom path to the trainers with the code field in your training recipe:

type: my_custom_trainer
code: path/to/custom_trainers.py
model:
  name_or_path: meta-llama/Llama-3.1-8B-Instruct
data:
  sources:
    - HuggingFaceH4/ultrachat_200k

You may also wish to create a new model factory, data factory, etc. to accompany your new trainer. This can also be done in the same python script and these classes will automatically be registered as well:

from arctic_training import HFModelFactory, SFTTrainer

class CustomModelFactory(HFModelFactory):
    name = "my_custom_model_factory"

    def create_model(self, config):
        # Custom model implementation
        return model

class CustomTrainer(SFTTrainer):
    name = "my_custom_trainer"
    model_factory: CustomModelFactory

    def loss(self, batch):
        # Custom loss function implementation
        return loss

Run the training recipe with the ArcticTraining CLI:
```
arctic_training path/to/recipe.yaml
```
Under the hood our CLI will load the recipe, instantiate the trainer, model, etc. and start training.

Our CLI launcher uses the DeepSpeed launcher to create a distributed training environment. You can pass any DeepSpeed arguments after the training recipe path. For example, to train on 4 GPUs, you can run:
arctic_training path/to/recipe.yaml --num_gpus 4

Python API

ArcticTraining also provides a Python API that can be used to setup trainer and train your model. Here we show the same example as above but using the Python API:

from arctic_training import HFModelFactory, SFTTrainer, get_config

class CustomModelFactory(HFModelFactory):
    name = "my_custom_model_factory"

    def create_model(self, config):
        # Custom model implementation
        return model

class CustomTrainer(SFTTrainer):
    name = "my_custom_trainer"
    model_factory: CustomModelFactory

    def loss(self, batch):
        # Custom loss function implementation
        return loss

if __name__ == "__main__":
    config_dict = {
        "type": "my_custom_trainer",
        "model": {
            "name_or_path": "meta-llama/Llama-3.1-8B-Instruct"
        },
        "data": {
            "sources": ["HuggingFaceH4/ultrachat_200k"]
        }
        "checkpoint": [
            {
                "type": "huggingface",
                "save_end_of_training": True,
                "output_dir": "./fine-tuned-model"
            }
        ]
    }

    config = get_config(config_dict)
    trainer = CustomTrainer(config)
    trainer.train()

Datasets

How to use a dataset of your choice. Since there is no standard to how each dataset is defined it’s not always easy to write a generic API that will work with any dataset.

SFT Datasets

While one could write a class for any new dataset, we have designed a flexible dataset type that should allow to remap many existing Instruct/SFT datasets to this dataset type:

The role_mapping dict indicates how to locate the role and content within the dataset structure. We accept two types of inputs:

{role_name} : {column_name}
{role_name} : {column_name.filter_field.filter_value}

Additionally content_key can be used when a deep structure with complex columns is used and the value name needs remapping, see example 5 below for such a use-case.

Examples:

Dataset structure:

{"user": "What is the capital of France?", "assistant": "The capital of France is Paris."}

Config:

data:
  sources:
    - type: huggingface_instruct
      name_or_path:  Josephgflowers/Finance-Instruct-500k
      split: train

See https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k

Dataset structure:

{"instruction": "What is the capital of France?", "demonstration": "The capital of France is Paris."}

Config:

data:
  sources:
    - type: huggingface_instruct
      name_or_path: HuggingFaceH4/helpful-instructions
      split: train
      sample_count: 1000
      role_mapping:
        user: instruction
        assistant: demonstration

See https://huggingface.co/datasets/HuggingFaceH4/helpful-instructions

Dataset structure:

{"messages": [{"role": "user", "content": "Hello world"}, {"role": "assistant", "content": "Hi there"}]}

Config:

data:
  sources:
    - type: huggingface_instruct
      name_or_path: HuggingFaceH4/ultrachat_200k
      split: train_sft
      role_mapping:
        user: messages.role.user
        assistant: messages.role.assistant

See https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k

Dataset structure:

{"conversations": [{"role": "human", "content": "Hello world"}, {"role": "agent", "content": "Hi there"}]}

Config:

data:
  sources:
    - type: huggingface_instruct
      name_or_path: /path/to/data
      role_mapping:
        user: conversations.role.human
        assistant: conversations.role.agent

Dataset structure:

{"conversations": [{"sender": "system", "message": "Hello world"}, {"sender": "user, "message": "Hi there"}]}

Config:

data:
  sources:
    - type: huggingface_instruct
      name_or_path: recursal/Europarl-Translation-Instruct
      split: full
      sample_count: 1000
      role_mapping:
        user: conversations.sender.system
        assistant: conversations.sender.user
        content_key: message

https://huggingface.co/datasets/recursal/Europarl-Translation-Instruct additionally has the user/assistant roles reversed, so we can easily fix it in our remapping.