.. _usage: ===== Usage ===== After :ref:`installation `, you can use ArcticTraining to train your models using a simple YAML recipe or a Python script. Here we provide an overview of how to use each. ArcticTraining CLI ------------------ The ArcticTraining CLI is the easiest way to train your models using ArcticTraining and supports the use of custom trainers, data, etc. to meet your specific requirements. To train a model using the ArcticTraining CLI, follow these steps: 1. Create a training recipe YAML file with the necessary configuration options. For example, you can create a recipe to train a model using the `meta-llama/Llama-3.1-8B-Instruct `_ model and the `HuggingFaceH4/ultrachat_200k `_ dataset: .. code-block:: yaml model: name_or_path: meta-llama/Llama-3.1-8B-Instruct data: sources: - HuggingFaceH4/ultrachat_200k checkpoint: - type: huggingface save_end_of_training: true output_dir: ./fine-tuned-model 2. Optionally create a custom trainer by subclassing the ``Trainer`` or ``SFTTrainer`` classes and implementing the necessary modifications. For example, you could create a new trainer from ``SFTTrainer`` that uses a different loss function: .. code-block:: python from arctic_training import SFTTrainer class CustomTrainer(SFTTrainer): name = "my_custom_trainer" def loss(self, batch): # Custom loss function implementation return loss This new trainer will be automatically registered with ArcticTraining when the script containing the declaration of ``CustomTrainer`` is imported. By default, ArcticTraining looks for a ``train.py`` in the directory where the YAML training recipe is located to find custom trainers. You can also specify a custom path to the trainers with the ``code`` field in your training recipe: .. code-block:: yaml type: my_custom_trainer code: path/to/custom_trainers.py model: name_or_path: meta-llama/Llama-3.1-8B-Instruct data: sources: - HuggingFaceH4/ultrachat_200k You may also wish to create a new model factory, data factory, etc. to accompany your new trainer. This can also be done in the same python script and these classes will automatically be registered as well: .. code-block:: python from arctic_training import HFModelFactory, SFTTrainer class CustomModelFactory(HFModelFactory): name = "my_custom_model_factory" def create_model(self, config): # Custom model implementation return model class CustomTrainer(SFTTrainer): name = "my_custom_trainer" model_factory: CustomModelFactory def loss(self, batch): # Custom loss function implementation return loss 3. Run the training recipe with the ArcticTraining CLI: .. code-block:: bash arctic_training path/to/recipe.yaml Under the hood our CLI will load the recipe, instantiate the trainer, model, etc. and start training. Our CLI launcher uses the DeepSpeed launcher to create a distributed training environment. You can pass any DeepSpeed arguments after the training recipe path. For example, to train on 4 GPUs, you can run: .. code-block:: bash arctic_training path/to/recipe.yaml --num_gpus 4 Python API ---------- ArcticTraining also provides a Python API that can be used to setup trainer and train your model. Here we show the same example as above but using the Python API: .. code-block:: python from arctic_training import HFModelFactory, SFTTrainer, get_config class CustomModelFactory(HFModelFactory): name = "my_custom_model_factory" def create_model(self, config): # Custom model implementation return model class CustomTrainer(SFTTrainer): name = "my_custom_trainer" model_factory: CustomModelFactory def loss(self, batch): # Custom loss function implementation return loss if __name__ == "__main__": config_dict = { "type": "my_custom_trainer", "model": { "name_or_path": "meta-llama/Llama-3.1-8B-Instruct" }, "data": { "sources": ["HuggingFaceH4/ultrachat_200k"] } "checkpoint": [ { "type": "huggingface", "save_end_of_training": True, "output_dir": "./fine-tuned-model" } ] } config = get_config(config_dict) trainer = CustomTrainer(config) trainer.train() Datasets ---------- How to use a dataset of your choice. Since there is no standard to how each dataset is defined it's not always easy to write a generic API that will work with any dataset. SFT Datasets ============ While one could write a class for any new dataset, we have designed a flexible dataset type that should allow to remap many existing Instruct/SFT datasets to this dataset type: The ``role_mapping`` dict indicates how to locate the role and content within the dataset structure. We accept two types of inputs: 1. ``{role_name} : {column_name}`` 2. ``{role_name} : {column_name.filter_field.filter_value}`` Additionally ``content_key`` can be used when a deep structure with complex columns is used and the value name needs remapping, see example 5 below for such a use-case. Examples: 1. Dataset structure: .. code:: python {"user": "What is the capital of France?", "assistant": "The capital of France is Paris."} Config: .. code:: yaml data: sources: - type: huggingface_instruct name_or_path: Josephgflowers/Finance-Instruct-500k split: train See https://huggingface.co/datasets/Josephgflowers/Finance-Instruct-500k 2. Dataset structure: .. code:: python {"instruction": "What is the capital of France?", "demonstration": "The capital of France is Paris."} Config: .. code:: yaml data: sources: - type: huggingface_instruct name_or_path: HuggingFaceH4/helpful-instructions split: train sample_count: 1000 role_mapping: user: instruction assistant: demonstration See https://huggingface.co/datasets/HuggingFaceH4/helpful-instructions 3. Dataset structure: .. code:: python {"messages": [{"role": "user", "content": "Hello world"}, {"role": "assistant", "content": "Hi there"}]} Config: .. code:: yaml data: sources: - type: huggingface_instruct name_or_path: HuggingFaceH4/ultrachat_200k split: train_sft role_mapping: user: messages.role.user assistant: messages.role.assistant See https://huggingface.co/datasets/HuggingFaceH4/ultrachat_200k 4. Dataset structure: .. code:: python {"conversations": [{"role": "human", "content": "Hello world"}, {"role": "agent", "content": "Hi there"}]} Config: .. code:: yaml data: sources: - type: huggingface_instruct name_or_path: /path/to/data role_mapping: user: conversations.role.human assistant: conversations.role.agent 5. Dataset structure: .. code:: python {"conversations": [{"sender": "system", "message": "Hello world"}, {"sender": "user, "message": "Hi there"}]} Config: .. code:: yaml data: sources: - type: huggingface_instruct name_or_path: recursal/Europarl-Translation-Instruct split: full sample_count: 1000 role_mapping: user: conversations.sender.system assistant: conversations.sender.user content_key: message https://huggingface.co/datasets/recursal/Europarl-Translation-Instruct additionally has the user/assistant roles reversed, so we can easily fix it in our remapping.