ArcticSynth

ArcticSynth is a Python client for data synthesis in batch. It provides support for different services and abstracts the complexity away. It provides functionalities to manage batch tasks, including adding tasks, saving, uploading, submitting, retrieving, and downloading batch tasks. For (Azure) OpenAI, it provides both asynchronous and synchronous functions for data synthesis in batch. You can choose to submit and monitor the task execution, or run and forget, as if you’re using an online inference service but 50% cheaper and way faster. We’ll walk you through it.

Installation

ArcticSynth is offered as part of ArcticTraining. Installing ArcticTraining will also install the ArcticSynth CLI. You will only need ArcticSynth CLI if you’re using (Azure) OpenAI services.

cd ArcticTraining

# If using OpenAI/Azure OpenAI
pip install -e .

# If using Snowflake Cortex
pip install -e '.[cortex]'

# If using vLLM
pip install -e '.[vllm]'

Usage

Supported Services

ArcticSynth currently provides the following classes to support their corresponding services:

Initialization

To initialize the ArcticSynth client, you need to provide your API key and other necessary configurations. For example, with Azure OpenAI:

from arctic_training.arctic_synth import AzureOpenAISynth
import os

AZURE_OPENAI_API_KEY = os.getenv("AZURE_OPENAI_API_KEY")

client = AzureOpenAISynth(
    api_key=AZURE_OPENAI_API_KEY,
    api_version="2024-07-01-preview",
    azure_endpoint="https://<your-endpoint-url>",
)

Optionally, you can set batch_size to reduce or increase how many requests per batch has. The default size is 100,000, which is maximum allowed by Azure OpenAI. However, if your request is large, for example, if it contains encoded images, you can set it to a smaller number to avoid each file to be too large. The partition of batch files and merging of downloaded result files will be taken care of automatically by ArcticSynth.

Note

Configurations vary for different services and API providers. You can check the class definitions in this document or refer to Snowflake Cortex example and vLLM example.

Adding Tasks

You can add chat tasks to a batch using the add_chat_to_batch_task method. The parameters are consistent with the original OpenAI API.

client.add_chat_to_batch_task(
    task_name="test_task",
    model="<model-name>",
    messages=[
        {"role": "user", "content": "Hello world!"},
    ],
)

You can also set work_dir to a different path. Default is ./batch_work_dir.

Synchronous: Executing Tasks

For all services, we provide a synchronous method to execute the task. For (Azure) OpenAI, this method will submit the task to the service and wait for the result. For other services, it will run the task locally and return the result.

execution_results = client.execute_batch_task("test_task")

You can parse the results with the extract_messages_from_responses function:

parsed_results = client.extract_messages_from_responses(execution_results)

Asynchronous: Saving, Uploading, and Submitting Tasks

Note

This section is only applicable to (Azure) OpenAI.

As an alternative to synchronous method execute_batch_task, you can choose to submit the task to (Azure) OpenAI and download the results later. To do so, after adding tasks, you can save, upload, and submit them to (Azure) OpenAI.

client.save_batch_task("test_task")
client.upload_batch_task("test_task")
client.submit_batch_task("test_task")

Asynchronous: Retrieving and Downloading Tasks

Note

This section is only applicable to (Azure) OpenAI.

You can retrieve the status of your batch tasks and download the results when they are ready.

client.retrieve_batch_task("test_task")
client.download_batch_task("test_task")

Command Line Interface (CLI)

Note

This section is only applicable to (Azure) OpenAI.

ArcticSynth also provides a command line interface for managing batch tasks.

Usage

arctic_synth -t <task_name> [options]

Options

  • -c, --credential: Credential file path (default: ~/.arctic_synth/credentials/default.yaml). This will be auto-saved if you previously used ArcticSynth to add requests to a task. Normally you shouldn’t worry about it.

  • -w, --work_dir: Work directory (default: ./batch_work_dir)

  • -t, --task: Task name (required for most operations)

  • -u, --upload: Upload task to Azure OpenAI

  • -s, --submit: Submit task to Azure OpenAI

  • -r, --retrieve: Retrieve task status from Azure OpenAI.

  • -d, --download: Download task from Azure OpenAI. You’ll find the downloaded files in downloads and a merged jsonl file results.jsonl in your task dir.

  • --clean_files_older_than_n_days: Clean files older than n days in the (Azure) OpenAI file storage.

Warning

--clean_files_older_than_n_days will clean all files (not just yours) older than n days in the (Azure) OpenAI file storage. Use with extra caution.

Synthesizer Classes

class arctic_training.synth.OpenAISynth(work_dir='./batch_work_dir', credential_path=None, save_to_credential_file='~/.arctic_synth/credentials/default.yaml', batch_size=50000, polling_interval=30, file_expiry_seconds=2592000, *args, **kwargs)[source]

Bases: OpenAI, OpenAIBatchProcessor

OpenAI Synthesizer. This class is a wrapper around OpenAI Python SDK. It manages batch processing by maintaing a work directory to store batch tasks and results. This class works with OpenAI platform (https://platform.openai.com/). Please refer to OpenAIBatchProcessor for methods.

Parameters:
  • work_dir (str)

  • credential_path (str | None)

  • save_to_credential_file (str)

  • batch_size (int)

  • polling_interval (int)

  • file_expiry_seconds (int)

add_chat_to_batch_task(task_name, **kwargs)

Add a chat completion request to the batch task.

cancel_batch_task(task_name)

Cancel batch task from (Azure) OpenAI Batch API.

download_batch_task(task_name)

Download batch task results from (Azure) OpenAI Files API.

execute_batch_task(task_name, print_status=False)

A synchronous method to execute the batch task. This method will block until the task is completed. The batch task is executed by sequentially saving, uploading, submitting, polling, and downloading.

static extract_messages_from_responses(responses)

Extract the response from the response object.

retrieve_batch_task(task_name=None, batch_ids=None, print_status=True)

Retrieve batch task from (Azure) OpenAI Batch API.

retrieve_uploaded_files(task_name=None, file_ids=None, print_status=True)

Retrieve uploaded files from (Azure) OpenAI Files API.

save_batch_task(task_name)

Save batch task to the work directory.

submit_batch_task(task_name, file_ids=None)

Submit batch task to (Azure) OpenAI Batch API.

upload_batch_task(task_name)

Upload batch task to (Azure) OpenAI Files API.

class arctic_training.synth.AzureOpenAISynth(work_dir='./batch_work_dir', credential_path=None, save_to_credential_file='~/.arctic_synth/credentials/default.yaml', batch_size=100000, polling_interval=30, file_expiry_seconds=2592000, *args, **kwargs)[source]

Bases: AzureOpenAI, OpenAIBatchProcessor

Azure OpenAI Synthesizer. This class is a wrapper around OpenAI Python SDK. It manages batch processing by maintaing a work directory to store batch tasks and results. This class works with Azure OpenAI (https://oai.azure.com/). Please refer to OpenAIBatchProcessor for methods.

Parameters:
  • work_dir (str)

  • credential_path (str | None)

  • save_to_credential_file (str)

  • batch_size (int)

  • polling_interval (int)

  • file_expiry_seconds (int)

add_chat_to_batch_task(task_name, **kwargs)

Add a chat completion request to the batch task.

cancel_batch_task(task_name)

Cancel batch task from (Azure) OpenAI Batch API.

download_batch_task(task_name)

Download batch task results from (Azure) OpenAI Files API.

execute_batch_task(task_name, print_status=False)

A synchronous method to execute the batch task. This method will block until the task is completed. The batch task is executed by sequentially saving, uploading, submitting, polling, and downloading.

static extract_messages_from_responses(responses)

Extract the response from the response object.

retrieve_batch_task(task_name=None, batch_ids=None, print_status=True)

Retrieve batch task from (Azure) OpenAI Batch API.

retrieve_uploaded_files(task_name=None, file_ids=None, print_status=True)

Retrieve uploaded files from (Azure) OpenAI Files API.

save_batch_task(task_name)

Save batch task to the work directory.

submit_batch_task(task_name, file_ids=None)

Submit batch task to (Azure) OpenAI Batch API.

upload_batch_task(task_name)

Upload batch task to (Azure) OpenAI Files API.

class arctic_training.synth.CortexSynth(connection_params, work_dir=None)[source]

Bases: InMemoryBatchProcessor

Cortex Synthesizer. This class calls Snowflake Cortex complete service.

add_chat_to_batch_task(task_name, model, messages, options={'temperature': 1, 'top_p': 1})[source]

Add a chat completion request to the batch task.

execute_batch_task(task_name)[source]

A synchronous method to execute the batch task. This method will block until the task is completed.

static extract_messages_from_responses(responses)[source]

Extract the response from the response object.

class arctic_training.synth.VllmSynth(model_params, sampling_params=None, work_dir=None)[source]

Bases: InMemoryBatchProcessor

vLLM Synthesizer. This class initializes a local vLLM instance for fast batch inference. Currently, multi-node inference is not supported.

add_chat_to_batch_task(task_name, messages)[source]

Add a chat completion request to the batch task.

execute_batch_task(task_name)[source]

A synchronous method to execute the batch task. This method will block until the task is completed.

static extract_messages_from_responses(responses)[source]

Extract the response from the response object.