Training
Now, we'll tackle one of the most exciting capabilities: creating our own specialized AI models. A general-purpose model is a jack-of-all-trades, but what if you need a master of one?
This is where the Training Pipeline comes in. It is a specialized "university" for language models.
The Problem: A Generalist in a Specialist's World
Imagine you have a powerful, general AI that knows about history, science, and art. Now, you want to use it to provide expert-level customer support for your new software product. If a user asks, "How do I configure the Z-widget in version 3.2?", the general AI has no idea. It wasn't trained on your product's documentation.
We need a way to take this smart, general model and send it to "school" to learn a new, specific subject. We need to turn our generalist into a specialist.
The Solution: An AI University
The Training Pipeline is the infrastructure for this AI university. It allows you to fine-tune a general-purpose model on your own private data, turning it into an expert on a particular topic.
You play the role of the university's dean. You provide:
- The Curriculum: A dataset of examples you want the model to learn from (e.g., your product's help documents or past support tickets).
- The Teaching Method: A set of training parameters that define how the model should learn.
The pipeline then manages the entire, resource-intensive training process, from preparing the data to running the "classes" on powerful GPUs, ultimately graduating a new, customized model.
How to Use the Training Pipeline
Using the training pipeline is a three-step process:
- Start a Training Node, the server that runs the university.
- Define your curriculum and teaching method (
TrainingParams
). - Submit your training job and monitor its progress.
Step 1: Start the Training Node
First, the person providing the computing power needs to start a Training Node. This is done from the command line, just like our other service nodes.
This starts a server listening on port 8080
, ready to accept new training jobs.
Step 2: Define the "Curriculum" and "Teaching Method"
Now, as a user, you need to define your training job. We do this by creating a set of TrainingParams
. Let's say we want to teach a small model to be an expert on our product. We'll use a technique called LoRA, which is a very efficient way to fine-tune a model.
The parameters are defined using Pydantic models from alith/training/types.py
. Let's create our parameters in Python.
This configuration tells the pipeline: "Take the Qwen2-0.5B
model and fine-tune it using the lora
method for 3
epochs on the data found at my URL." There are many more parameters you can tweak, but these are the most important ones to start with.
Step 3: Submit and Monitor the Job
Training can take hours or even days, so we don't wait for it to finish. We submit it as a background job. We'll use the requests
library to send our TrainingParams
to the running Training Node.
Output:
Now the training is running in the background! We can use our job_id
to check on its progress at any time.
Output:
You can call this status endpoint periodically until the percentage
reaches 100. Once finished, a new model, fine-tuned on your data, will be saved in the node's output directory.
Under the Hood: The Life of a Training Job
When you submit a training job, a carefully orchestrated process begins.
- Job Submission: Your client sends an HTTP POST request to the
/v1/training
endpoint on the Training Node. The body of the request contains yourTrainingParams
. - Validation & ID Generation: The server receives the request. It quickly validates the parameters and generates a unique
job_id
(e.g.,a1b2c3d4
). - Immediate Response: The server immediately sends a
202 Accepted
response back to you, containing thejob_id
. This non-blocking design is crucial because the actual training will take a long time. - Background Task: The server adds the real training work to a background task queue. This is where the heavy lifting happens.
- Data Preparation: The background trainer downloads your data from the
data_url
, decrypts it if necessary, and prepares it for the model. - Model Training: The trainer loads the base model and starts the fine-tuning process using your specified parameters. It periodically writes its progress (percentage, loss, etc.) to a log file associated with the
job_id
. - Status Check: When you make a GET request to
/v1/training/{job_id}
, the server simply reads the latest entry from that specific log file and returns it to you.
Diving into the Code
The logic is split across a few key files in alith/training/
.
First, alith/training/server.py
starts the web server. This is the entry point that listens for your requests.
Next, alith/training/service.py
defines the API endpoints. The training
function handles new job submissions. Notice BackgroundTasks
—this is FastAPI's way of running long jobs without blocking the server.
Finally, alith/training/trainer.py
contains the start_trainer
function. This is where the actual training happens. It's a wrapper that calls run_exp
from llama-factory
, a powerful open-source library specialized for fine-tuning models.
This shows a smart design: our project provides a user-friendly API and pipeline for managing jobs, while delegating the complex, low-level training algorithms to a specialized, best-in-class library.