The Fine-Tuning Method Decision
When you decide to fine-tune an LLM, the next question is: LoRA or full fine-tuning? This decision affects your GPU costs, training time, model quality, and deployment complexity.
At DeepQuantica, we have fine-tuned hundreds of models using both approaches. This guide shares what we have learned.
How Full Fine-Tuning Works
Full fine-tuning updates every parameter in the model during training. For a 7B model, that means updating all 7 billion parameters on every training step.
Requirements:
- Multiple GPUs with high VRAM (A100 80GB or better)
- Gradient memory for all parameters
- Optimizer states for all parameters (2-3x model size for Adam)
- Total: 80-100GB+ VRAM for a 7B model
What you get:
- Maximum model capacity for new knowledge
- Potentially highest quality on complex tasks
- Full model weight update
How LoRA Works
LoRA (Low-Rank Adaptation) freezes the original model weights and trains small adapter matrices injected into transformer layers. Instead of updating 7 billion parameters, you train a few million.
Requirements:
- Single GPU (A100 40GB for LoRA, or 16GB with QLoRA)
- Adapter parameters only (0.1-1% of total)
- Optimizer states for adapters only
- Total: 16-40GB VRAM for a 7B model
What you get:
- 95-100% of full fine-tuning quality on most tasks
- 10-100x less training cost
- Modular adapters that can be swapped
- Faster training iterations
Performance Comparison
Based on our production fine-tuning experience at DeepQuantica:
Task-Specific Quality
For domain-specific tasks (customer support, legal, medical, code):
- LoRA achieves 95-99% of full fine-tuning quality
- The gap narrows with higher LoRA rank (r=32-64)
- Quality is indistinguishable for most business applications
Knowledge Injection
For injecting substantial new knowledge:
- Full fine-tuning has a slight edge for very specialized domains
- LoRA with high rank (r=64) approaches full fine-tuning
- For most use cases, the difference is not meaningful
Style and Format Adaptation
For output formatting, writing style, and tone:
- LoRA and full fine-tuning perform equally well
- Even low-rank LoRA (r=8) captures style effectively
- This is LoRA's strongest use case
Cost Comparison
| Factor | Full Fine-Tuning (7B) | LoRA (7B) | QLoRA (7B) |
|--------|----------------------|-----------|------------|
| GPU Required | 2x A100 80GB | 1x A100 40GB | 1x T4 16GB |
| Training Time (5K examples) | 4-8 hours | 1-3 hours | 2-4 hours |
| GPU Cost per Run | $50-150 | $10-30 | $5-15 |
| Storage per Model | 14GB (full) | 50-200MB (adapter) | 50-200MB (adapter) |
| Experiments per Dollar | 1-2 | 5-10 | 10-20 |
LoRA enables 5-20x more experiments for the same budget. This means more iterations, better final models, and faster time to production.
When to Choose Full Fine-Tuning
Full fine-tuning is justified when:
1. Pre-training continuation: When the base model has zero knowledge of your domain and needs significant knowledge injection (not just style adaptation)
2. Maximum absolute performance: In rare cases where 0.1-0.5% accuracy matters and budget is not constrained
3. Small models: For models under 1B parameters, full fine-tuning is affordable and can outperform LoRA
4. Unlimited budget: When GPU cost is genuinely not a concern
These scenarios represent less than 5% of production fine-tuning projects.
When to Choose LoRA
LoRA is the right choice when:
1. Most production fine-tuning: The default recommendation for 90%+ of use cases
2. Rapid iteration: Need to try multiple configurations quickly
3. Multi-task deployment: Multiple adapters on a single base model
4. Cost efficiency: Standard budget constraints apply
5. GPU constraints: Limited access to high-end GPUs
LoRA Best Practices (From Our Experience)
Rank Selection
- r=8: Formatting and style changes
- r=16: General-purpose fine-tuning (our default in SnapML Auto LLM)
- r=32: Domain knowledge injection
- r=64: Complex tasks requiring maximum LoRA capacity
Alpha Value
Set alpha = 2x rank as the starting point. SnapML Auto LLM determines optimal alpha automatically.
Target Modules
Always target all attention layers (q_proj, k_proj, v_proj, o_proj). For higher quality, also target MLP layers (gate_proj, up_proj, down_proj). SnapML targets all layers by default.
Learning Rate
LoRA benefits from higher learning rates than full fine-tuning:
- Full fine-tuning: 1e-5 to 5e-6
- LoRA: 1e-4 to 3e-4
Merging for Production
At deployment time, merge LoRA weights into the base model for zero latency overhead. SnapML handles this automatically during deployment.
LoRA and Full Fine-Tuning in SnapML
SnapML's Auto LLM uses LoRA by default for all fine-tuning:
- Automatic rank selection based on dataset size and task
- QLoRA for memory-constrained configurations
- Multiple adapter management and comparison
- One-click merge and deploy
For the rare cases requiring full fine-tuning, SnapML supports it on multi-GPU configurations through our engineering services.
Conclusion
For 95% of production LLM fine-tuning projects, LoRA is the right choice. It delivers comparable quality at a fraction of the cost, enables rapid iteration, and simplifies deployment with modular adapters. Full fine-tuning is reserved for edge cases where maximum absolute performance justifies the significantly higher compute cost. SnapML by DeepQuantica makes LoRA fine-tuning accessible through Auto LLM, handling configuration automatically so you can focus on your data and use case.