What Is PEFT?
PEFT (Parameter-Efficient Fine-Tuning) is a family of techniques that adapt large language models to specific tasks by training only a small fraction of the total parameters. Instead of updating billions of weights during fine-tuning, PEFT methods train millions or even thousands of parameters while keeping the original model frozen.
This dramatically reduces GPU memory requirements, training time, and storage costs while achieving comparable quality to full fine-tuning.
Why PEFT Matters
The Full Fine-Tuning Problem
Fine-tuning a 7B parameter model with full weight updates requires:
- 28GB just for model weights (FP32)
- 56GB+ for gradients and optimizer states
- Total: 80-100GB+ VRAM for a single training run
For 70B models, you need a cluster of 8+ A100 80GB GPUs. This is expensive and impractical for most organizations.
The PEFT Solution
With LoRA (rank 16), the same 7B model requires:
- 14GB for model weights (FP16)
- ~10MB for trainable LoRA parameters
- Total: ~16GB VRAM
With QLoRA, it drops to ~6GB, fitting on consumer GPUs.
PEFT Techniques
LoRA (Low-Rank Adaptation)
How it works: Freezes original weights and injects trainable low-rank matrices into transformer layers. Weight update is decomposed as W' = W + BA where B and A are small matrices.
Key parameters:
- Rank (r): Controls adapter capacity (8-64 typical)
- Alpha: Scaling factor (usually 2x rank)
- Target modules: Which layers get adapters
When to use: Most production fine-tuning scenarios. The default recommendation for 90%+ of use cases.
Used in SnapML: Yes, as the primary fine-tuning method in Auto LLM.
QLoRA (Quantized LoRA)
How it works: Combines LoRA with 4-bit model quantization. The base model is stored in 4-bit NormalFloat (NF4) format while LoRA adapters train in higher precision.
Advantages over LoRA:
- 4x less GPU memory for the base model
- Double quantization for additional savings
- Paged optimizers for memory spike handling
When to use: When GPU memory is limited. Training 7B models on 16GB GPUs or 70B models on single A100s.
Used in SnapML: Yes, selectable as a training option or auto-selected by Auto LLM when GPU memory is constrained.
Prefix Tuning
How it works: Prepends trainable vectors (prefixes) to the key and value representations in each transformer layer. The model learns task-specific context through these prefixes.
Advantages:
- Very few trainable parameters
- Task switching by swapping prefixes
Limitations:
- Lower quality than LoRA on most tasks
- Less mature tooling support
When to use: Multi-task scenarios where you need to switch between many tasks efficiently.
Prompt Tuning
How it works: Trains a small set of continuous "soft prompt" vectors that are prepended to the input embedding. Simpler than prefix tuning but less expressive.
When to use: Simple task adaptation where you need minimal overhead.
Adapter Layers
How it works: Inserts small feedforward networks (adapters) between existing transformer layers. Each adapter has a down-projection, activation, and up-projection.
When to use: Legacy approach, largely superseded by LoRA for LLM fine-tuning.
Comparison Table
| Method | Trainable Params | Memory Savings | Quality | Inference Overhead |
|--------|------------------|----------------|---------|-------------------|
| Full Fine-Tuning | 100% | None | Baseline | None |
| LoRA | 0.1-1% | 60-80% | 95-100% of full | None (can merge) |
| QLoRA | 0.1-1% | 85-95% | 93-99% of full | Quantization loss |
| Prefix Tuning | 0.01-0.1% | 70-85% | 85-95% of full | Minor |
| Prompt Tuning | <0.01% | 75-90% | 80-90% of full | Minor |
| Adapters | 1-5% | 40-60% | 90-97% of full | ~5% latency |
LoRA Best Practices
Based on our experience at DeepQuantica fine-tuning hundreds of models:
Rank Selection
- r=8: Simple style adaptation, formatting changes
- r=16: General-purpose fine-tuning (our default)
- r=32: Complex domain knowledge injection
- r=64: Maximum capacity for difficult tasks
Target Modules
Always include all attention projections (q_proj, k_proj, v_proj, o_proj). For higher quality, also include MLP layers (gate_proj, up_proj, down_proj). SnapML's Auto LLM targets all modules by default.
Learning Rate
LoRA benefits from higher learning rates than full fine-tuning:
- Full fine-tuning: 1e-5 to 5e-6
- LoRA: 1e-4 to 3e-4
Merging
At inference time, LoRA weights can be merged into the base model with zero overhead. SnapML handles this automatically during deployment.
PEFT in SnapML
SnapML leverages PEFT through its Auto LLM feature:
1. Automatic method selection: LoRA by default, QLoRA when memory-constrained
2. Optimal configuration: Auto-tuned rank, alpha, and target modules
3. Multi-adapter support: Multiple LoRA adapters on a single base model
4. Merge on deploy: Automatic weight merging for zero-overhead inference
5. Adapter management: Version, compare, and switch between fine-tuned adapters
Conclusion
PEFT techniques, especially LoRA and QLoRA, have made LLM fine-tuning practical for organizations of all sizes. They deliver 95%+ of full fine-tuning quality at a fraction of the cost and compute. SnapML by DeepQuantica builds on these techniques in its Auto LLM feature, making parameter-efficient fine-tuning accessible without requiring deep knowledge of the underlying methods.