Trainers
deep-ml provides three trainer implementations for different use cases.
FabricTrainer
Recommended for most use cases. Uses Lightning Fabric for seamless distributed training.
Features
Distributed training (DDP, FSDP, DeepSpeed)
Mixed precision training
Multi-GPU support
Automatic device placement
Simple API
Basic Usage
from deepml.fabric_trainer import FabricTrainer
trainer = FabricTrainer(
task=task,
optimizer=optimizer,
criterion=criterion,
accelerator='auto', # 'cpu', 'cuda', 'mps', 'gpu', 'tpu'
strategy='auto', # 'dp', 'ddp', 'fsdp', 'deepspeed'
devices='auto', # Number of devices or 'auto'
precision='32-true', # '16-mixed', '32-true', 'bf16-mixed'
num_nodes=1 # For multi-node training
)
Configuration Options
Accelerator Options:
'cpu': CPU training'cuda'/'gpu': Single or multi-GPU'mps': Apple Silicon GPU'tpu': Google Cloud TPU'auto': Automatic selection
Strategy Options:
'dp': DataParallel (single-node)'ddp': DistributedDataParallel (recommended)'fsdp': Fully Sharded Data Parallel'deepspeed': Microsoft DeepSpeed'auto': Automatic selection
Precision Options:
'32-true': Full precision (FP32)'16-mixed': Mixed precision (FP16)'bf16-mixed': Mixed precision (BF16)'64-true': Double precision (FP64)
Training Method
trainer.fit(
train_loader=train_loader,
val_loader=val_loader,
epochs=50,
save_model_after_every_epoch=10,
metrics={'accuracy': Accuracy()},
gradient_accumulation_steps=4,
gradient_clip_value=1.0, # Clip by value
gradient_clip_max_norm=None, # Clip by norm
resume_from_checkpoint='path/to/checkpoint.pt',
load_optimizer_state=True,
load_scheduler_state=True,
logger=mlflow_logger,
non_blocking=True,
image_inverse_transform=denormalize,
logger_img_size=224
)
Advanced: Multi-Node Training
# Node 0
fabric run --node-rank=0 --num-nodes=2 --main-address=192.168.1.1 train.py
# Node 1
fabric run --node-rank=1 --num-nodes=2 --main-address=192.168.1.1 train.py
AcceleratorTrainer
Uses HuggingFace Accelerate for distributed training with additional flexibility.
Features
Same distributed strategies as FabricTrainer
Compatible with Accelerate CLI
Fine-grained control over gradient synchronization
Easy integration with Transformers
Basic Usage
from deepml.accelerator_trainer import AcceleratorTrainer
trainer = AcceleratorTrainer(
task=task,
optimizer=optimizer,
criterion=criterion,
lr_scheduler=lr_scheduler, # Note: instance, not factory
lr_scheduler_step_policy='epoch',
accelerator_config={
'gradient_accumulation_steps': 4,
'mixed_precision': 'fp16',
'device_placement': True,
'split_batches': False
}
)
Accelerator Configuration
accelerator_config = {
# Gradient accumulation
'gradient_accumulation_steps': 4,
# Mixed precision
'mixed_precision': 'fp16', # 'no', 'fp16', 'bf16'
# Device settings
'device_placement': True,
'split_batches': False,
# Logging
'log_with': 'tensorboard',
'project_dir': './logs',
# Advanced
'dispatch_batches': None,
'even_batches': True,
'step_scheduler_with_optimizer': True
}
Using Accelerate CLI
Create accelerate_config.yaml:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
mixed_precision: fp16
num_processes: 4
gpu_ids: all
Run training:
accelerate launch --config_file accelerate_config.yaml train.py
Learner (Deprecated)
Warning
This trainer is deprecated. Use FabricTrainer or AcceleratorTrainer instead.
Classic PyTorch trainer with manual device management.
from deepml.trainer import Learner
learner = Learner(
task=task,
optimizer=optimizer,
criterion=criterion,
lr_scheduler=lr_scheduler,
use_amp=True # Automatic Mixed Precision
)
learner.fit(
train_loader=train_loader,
val_loader=val_loader,
epochs=50
)
Choosing a Trainer
Use FabricTrainer when:
You want the easiest distributed training setup
You’re starting a new project
You need multi-node training
You want Lightning ecosystem integration
Use AcceleratorTrainer when:
You’re using HuggingFace models/ecosy stem
You need fine-grained control over distributed training
You prefer the Accelerate CLI workflow
You’re migrating from existing Accelerate code
Don’t use Learner:
It’s deprecated and will be removed in future versions
Use FabricTrainer or AcceleratorTrainer for new projects
Common Training Patterns
Gradient Accumulation
Simulate larger batch sizes:
trainer.fit(
train_loader=train_loader,
val_loader=val_loader,
epochs=50,
gradient_accumulation_steps=8 # Effective batch size = 8 * batch_size
)
Gradient Clipping
Prevent exploding gradients:
# Clip by value
trainer.fit(
...,
gradient_clip_value=1.0
)
# Clip by norm (recommended)
trainer.fit(
...,
gradient_clip_max_norm=1.0
)
Learning Rate Scheduling
from torch.optim.lr_scheduler import CosineAnnealingLR
# For FabricTrainer: use factory function
lr_scheduler_fn = lambda opt: CosineAnnealingLR(opt, T_max=50)
trainer = FabricTrainer(
...,
lr_scheduler_fn=lr_scheduler_fn,
lr_scheduler_step_policy='epoch' # or 'step'
)
# For AcceleratorTrainer: use instance
lr_scheduler = CosineAnnealingLR(optimizer, T_max=50)
trainer = AcceleratorTrainer(
...,
lr_scheduler=lr_scheduler,
lr_scheduler_step_policy='epoch'
)
Resume Training
trainer.fit(
...,
resume_from_checkpoint='./checkpoints/best_val_model.pt',
load_optimizer_state=True,
load_scheduler_state=True
)
Checkpoint Management
trainer.fit(
...,
save_model_after_every_epoch=10 # Save every 10 epochs
)
# Checkpoints saved:
# - best_val_model.pt (best validation loss)
# - epoch_10_model.pt, epoch_20_model.pt, ...
# - latest_model.pt (most recent)