LLM Fine-Tuning Playbook: When, Why & How
A comprehensive guide to adapting large language models for your specific domain, style, and performance requirements. From dataset preparation through production deployment, everything you need to make fine-tuning work at enterprise scale.
What Is LLM Fine-Tuning
Fine-tuning is the process of taking a pre-trained large language model and further training it on a curated dataset to specialise its behaviour for a specific task, domain, or style. Unlike pre-training — which teaches a model language from scratch using trillions of tokens — fine-tuning adjusts existing knowledge using hundreds to thousands of carefully crafted examples. The result is a model that retains broad language understanding while excelling at your particular use case.
It is important to distinguish fine-tuning from two related but different techniques. Prompt engineering adjusts model behaviour at inference time by crafting better instructions — no model weights change. RAG (retrieval-augmented generation) supplies external knowledge at query time to ground responses in specific documents. Fine-tuning changes the model itself, permanently altering how it reasons, writes, and responds. Each technique has its place, and production systems often combine all three.
When Generic Models Fall Short
General-purpose LLMs are trained on broad internet data. They write competent prose, answer general knowledge questions, and follow instructions reasonably well. But they struggle with tasks that require consistent adherence to a specific writing style, deep understanding of domain-specific terminology, structured output formats that must be followed precisely, or reasoning patterns unique to your industry. A legal AI that drafts contracts in your firm's house style, a medical AI that uses ICD-10 codes correctly, or a financial AI that follows your institution's compliance language — these require fine-tuning. Learn more in our LLM fine-tuning glossary entry.
When to Fine-Tune (Decision Framework)
Fine-tuning is powerful but not always the right choice. It requires dataset curation, GPU compute, evaluation infrastructure, and ongoing maintenance. Before committing, run through this decision framework to confirm that fine-tuning is the right investment for your use case.
Fine-Tune When...
Consistent Style or Tone
Your application requires outputs that match a specific brand voice, writing style, or formatting convention that prompting alone cannot reliably achieve. Fine-tuning bakes the style into the model weights, eliminating prompt-level workarounds.
Domain-Specific Reasoning
The model needs to understand specialised terminology, apply domain-specific logic, or follow workflows unique to your industry (legal analysis, medical coding, financial modelling). Generic models lack these capabilities even with detailed prompts.
Latency Constraints
Fine-tuning a smaller model (7B-13B parameters) to match a larger model's quality on your specific task lets you deploy faster inference with lower cost. A fine-tuned 7B model often outperforms a prompted 70B model on narrow tasks.
Data Privacy Requirements
Fine-tuning an open-source model and deploying it on-premises keeps all data within your infrastructure. No prompts or completions are sent to third-party APIs, satisfying strict data residency and sovereignty requirements.
Don't Fine-Tune When...
Knowledge Changes Frequently
If the information the model needs changes weekly or daily, fine-tuning cannot keep up. Use RAG instead — it retrieves current documents at query time without retraining. Fine-tuned knowledge is frozen at training time.
Small or Low-Quality Dataset
Fine-tuning with fewer than 200 examples or with noisy, inconsistent data is likely to degrade model performance rather than improve it. Invest in dataset quality first. If you cannot curate 500+ high-quality examples, prompt engineering is safer.
General Tasks Work Fine
If the base model with good prompting already meets your quality bar, fine-tuning adds cost and complexity without meaningful improvement. Always establish a prompted baseline before deciding to fine-tune.
No Evaluation Framework
Without clear metrics and test sets, you cannot tell if fine-tuning improved things. Build your evaluation pipeline before your training pipeline. For a detailed comparison, read our RAG vs fine-tuning guide.
Dataset Preparation
Dataset quality is the single biggest determinant of fine-tuning success. A small, clean, well-structured dataset consistently outperforms a large, noisy one. Plan to spend 60-70% of your fine-tuning project timeline on data preparation — this is where the real work happens.
Data Collection Strategies
Start by mining your existing workflows. Customer support logs, internal documentation, expert-written reports, and quality-reviewed outputs from your current AI tools are all rich sources. Structure each example as an instruction-response pair: the instruction describes what the model should do, and the response is the ideal output. For classification tasks, include the input text and the correct label. For generation tasks, provide both the prompt context and the gold-standard output.
Quality Over Quantity
Research consistently shows that 1,000 high-quality examples outperform 10,000 mediocre ones. Each example should be reviewed by a domain expert and meet these criteria: the instruction is unambiguous, the response is correct and complete, the formatting matches your target output, and there are no factual errors or inconsistencies. Deduplicate your dataset to avoid the model memorising repeated patterns at the expense of generalisation.
Data Formatting
Most fine-tuning frameworks expect data in a standard chat format with system, user, and assistant messages. For instruction tuning, use the Alpaca format (instruction, input, output) or the ShareGPT format (multi-turn conversations). Consistency is critical — every example should follow the exact same template. Validate your dataset programmatically before training to catch formatting errors that would silently degrade model quality.
Synthetic Data Generation
When real examples are scarce, use a stronger model (like GPT-4 or Claude) to generate synthetic training data. Provide the model with a few real examples and ask it to create variations. This technique — sometimes called model distillation — can bootstrap a dataset from 50 real examples to 2,000 synthetic ones. Always validate synthetic data with human review; models occasionally introduce subtle errors that propagate through training.
Training Methods
The training method you choose affects GPU requirements, training time, and the quality ceiling of your fine-tuned model. The trend in the industry is toward parameter-efficient methods (LoRA, QLoRA) that deliver near-full-fine-tuning quality at a fraction of the compute cost. Understanding the trade-offs helps you make the right call for your constraints.
| Method | Parameters Trained | GPU Memory | Quality Delta | Best For |
|---|---|---|---|---|
| Full Fine-Tuning | All | High (8x A100 for 70B) | Highest | Maximum performance when budget and infrastructure allow |
| LoRA | 1-10% | Moderate (1-2x A100 for 70B) | Near-full | Production fine-tuning with limited GPU budget |
| QLoRA | 1-10% (quantised base) | Low (1x A100 for 70B) | Good | Fine-tuning large models on consumer or single-GPU hardware |
| Instruction Tuning | Varies | Varies | Task-specific | Teaching models to follow instructions in a specific format |
| RLHF / DPO | All or LoRA | High | Alignment-focused | Aligning model outputs with human preferences and safety |
Full Fine-Tuning
Full fine-tuning updates every parameter in the model. This gives the optimizer maximum flexibility to adapt the model but requires significant GPU memory — roughly 4x the model size in mixed-precision training. For a 70B parameter model, that means approximately 280 GB of GPU memory, requiring multiple A100 80GB GPUs. Use full fine-tuning only when you need maximum quality and have the infrastructure to support it.
LoRA (Low-Rank Adaptation)
LoRA freezes the original model weights and injects small trainable matrices (rank 8-64) into the attention layers. Only these matrices are updated during training, reducing trainable parameters by 90-99%. The frozen base model is loaded in full precision or half-precision, and only the adapter weights need gradient computation. This cuts memory requirements by 50-75% compared to full fine-tuning. LoRA adapters are typically 10-100 MB in size, making them easy to version, share, and swap.
QLoRA (Quantised LoRA)
QLoRA takes LoRA further by quantising the frozen base model to 4-bit precision using the NormalFloat4 (NF4) data type. The LoRA adapters still train in 16-bit precision for stability, but the base model footprint shrinks by 4x. A 70B model that requires 140 GB in half-precision fits in roughly 35 GB with 4-bit quantisation, making it trainable on a single A100 80GB GPU. Quality is slightly below full LoRA but remains strong for most practical tasks.
RLHF and DPO
Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) align model outputs with human preferences. Instead of showing the model the "right answer," you present pairs of outputs and indicate which one humans preferred. RLHF trains a reward model and uses PPO to optimise the LLM against it. DPO simplifies this by directly optimising the language model on preference pairs without a separate reward model. Both are powerful for safety alignment, reducing harmful outputs, and steering the model toward preferred response patterns.
Model Selection
Your choice of base model determines the performance ceiling, deployment options, and licensing constraints of your fine-tuned system. The open-source ecosystem has matured dramatically — models like Llama 3 and Mistral now approach proprietary model quality on many benchmarks, and their permissive licenses enable on-premises deployment.
| Model | License | Strengths |
|---|---|---|
| Llama 3 (8B / 70B) | Meta Community | Strong baseline, large community, flexible deployment |
| Mistral (7B / 8x7B) | Apache 2.0 | Excellent quality-to-size ratio, mixture-of-experts option |
| Phi-3 (3.8B / 14B) | MIT | Small footprint, strong reasoning for its size, edge deployable |
| GPT-4o / GPT-4o-mini | Proprietary API | Highest baseline quality, simple API-based fine-tuning |
| Claude (via partner) | Proprietary API | Strong instruction following, long context window |
Decision Criteria
Consider four factors when choosing a base model. First, licensing: if the model must run on your infrastructure without usage restrictions, choose an open-source model with an Apache 2.0 or MIT license. Second, size: larger models generally perform better but cost more to serve. For latency-sensitive applications, a fine-tuned 7B model often delivers better value than a prompted 70B model. Third, community support: models with active fine-tuning communities (Llama, Mistral) have more tutorials, adapters, and debugging resources. Fourth, baseline performance: always benchmark the base model on your task with prompting alone before fine-tuning — if it already scores 90% on your evaluation, the marginal gain from fine-tuning may not justify the investment.
Training Infrastructure
Fine-tuning requires GPU compute, and the cost and availability of GPUs remains a primary constraint for most organisations. The good news is that parameter-efficient methods (LoRA, QLoRA) have dramatically lowered the hardware bar. Understanding your options helps you budget accurately and avoid over-provisioning.
GPU Requirements by Model Size
7B Models
Full fine-tuning: 1x A100 40GB. LoRA: 1x A100 40GB or RTX 4090 24GB. QLoRA: 1x RTX 4090 or even RTX 3090 24GB. Training time: 1-4 hours for 10K examples.
13B-30B Models
Full fine-tuning: 2-4x A100 80GB. LoRA: 1x A100 80GB. QLoRA: 1x A100 40GB or 2x RTX 4090. Training time: 4-12 hours for 10K examples.
70B+ Models
Full fine-tuning: 8x A100 80GB. LoRA: 2x A100 80GB. QLoRA: 1x A100 80GB. Training time: 8-24 hours for 10K examples. Use DeepSpeed or FSDP for distributed training.
Cloud Options
All major cloud providers offer GPU instances suitable for fine-tuning. AWS provides p4d (A100) and p5 (H100) instances. GCP offers A2 (A100) and A3 (H100) machine types. Azure has NC-series (A100) and ND-series (H100) VMs. For cost-sensitive experiments, spot/preemptible instances reduce costs by 60-70% with the risk of interruption — use checkpointing to resume training after preemption. Specialised platforms like Lambda Labs, RunPod, and Together AI often offer lower per-GPU-hour pricing than hyperscalers.
MLOps Tooling
Track experiments with Weights & Biases or MLflow to log hyperparameters, training curves, and evaluation metrics across runs. Use Hugging Face's Transformers and TRL libraries as your training framework — they support LoRA, QLoRA, DPO, and RLHF out of the box. Version your datasets, model checkpoints, and adapter weights in a model registry (Hugging Face Hub, MLflow, or a custom S3 bucket with versioning). Reproducibility is critical for debugging and for satisfying audit requirements.
Evaluation & Testing
A fine-tuned model is only as good as your ability to measure its improvement over the base model. Build your evaluation framework before you start training — it defines the success criteria that determine whether a training run is worth deploying. Without rigorous evaluation, you are flying blind.
Task-Specific Metrics
Choose metrics that directly measure task performance. For text generation: BLEU, ROUGE, and BERTScore measure output similarity to reference texts. For classification: accuracy, precision, recall, and F1 score on a held-out test set. For code generation: pass@k (percentage of generated solutions that pass unit tests). For structured output: exact-match accuracy and schema validation rates. Always compare the fine-tuned model against the prompted base model on the same test set.
Human Evaluation
Automated metrics capture surface-level quality but miss nuance that matters to end users. Run blind A/B evaluations where domain experts compare outputs from the base model and the fine-tuned model without knowing which is which. Score on a rubric covering correctness, style adherence, completeness, and safety. A sample of 100-200 evaluation pairs provides statistically meaningful results. Track win rates and use them to decide whether to ship the fine-tuned model.
Regression Testing
Fine-tuning can improve target-task performance while degrading general capabilities — a phenomenon known as catastrophic forgetting. Maintain a regression test suite that covers general language tasks (summarisation, Q&A, instruction following) alongside your domain-specific tests. If the fine-tuned model drops more than 5% on general benchmarks, reduce the learning rate, limit training epochs, or switch to LoRA to preserve more of the original model's capabilities.
A/B Testing in Production
Lab evaluations do not always predict production performance. Deploy the fine-tuned model alongside the base model and route a percentage of traffic to each. Measure user satisfaction (thumbs up/down, task completion rates), response latency, and cost per query. Gradually increase traffic to the fine-tuned model as confidence builds. Maintain the ability to instantly roll back to the base model if issues emerge.
Production Deployment
Deploying a fine-tuned model to production is more than copying weights to a server. You need an inference stack that delivers low latency at acceptable cost, monitoring to detect quality drift, versioning to manage model updates, and rollback capability for when things go wrong.
Model Serving Frameworks
vLLM is the current standard for high-throughput LLM inference, using PagedAttention to maximise GPU utilisation. Text Generation Inference (TGI) by Hugging Face offers a production-ready HTTP API with built-in batching and streaming. NVIDIA Triton Inference Server provides the most flexibility for multi-model deployments. For LoRA-based fine-tunes, vLLM supports serving multiple LoRA adapters from a single base model, switching adapters per-request with minimal overhead.
Quantisation for Inference
Quantisation reduces model precision from 16-bit to 8-bit or 4-bit, cutting GPU memory requirements and improving throughput. GPTQ and AWQ are the most popular post-training quantisation methods, with AWQ generally preserving slightly more quality. GGUF format (used by llama.cpp) enables CPU inference for smaller models. A quantised 7B model runs on a single GPU with 16 GB of VRAM, making deployment accessible on consumer hardware or cost-effective cloud instances.
Monitoring and Observability
Monitor three dimensions in production: system metrics (GPU utilisation, latency p50/p95/p99, throughput, error rates), quality metrics (user feedback scores, automated eval scores on a rolling sample), and cost metrics (tokens processed, cost per query, monthly spend). Set up alerts for latency spikes, quality drops below threshold, and budget overruns. Log all inputs and outputs (with appropriate data handling) for debugging and future training data collection.
Model Versioning and Rollback
Treat model deployments like software releases. Tag each model version with a semantic version number, the training dataset hash, and evaluation scores. Store model artefacts in a versioned registry. Implement blue-green or canary deployment strategies so you can instantly roll back to the previous version if the new model underperforms. Keep at least the two most recent production versions warm for rapid switching.
LLM Fine-Tuning FAQ
Answers to the most common questions about fine-tuning large language models for enterprise use.
About the Authors
This fine-tuning guide is authored by ML engineers who have trained and deployed custom models for enterprises across finance, healthcare, and technology.
AINinza AI Team
AI Solutions Architects
Our multidisciplinary team of AI engineers and solution architects share practical insights from enterprise AI deployments across industries.
Neha Sharma
Technical Writer
Technical writer at AINinza covering AI trends, implementation guides, and best practices for enterprise AI adoption.
Related Guides
Explore complementary resources to build a complete enterprise AI strategy.
End-to-end fine-tuning delivery from dataset curation to production deployment.
Read GuideWhen RAG is the better choice and how to combine it with fine-tuning.
Read GuideDecision framework for choosing between RAG, fine-tuning, or both.
Read GuideReady to Fine-Tune Your LLM?
Whether you need to adapt an open-source model for your domain or optimise a proprietary model's performance, our team brings the dataset expertise, training infrastructure, and evaluation rigour you need. Let's scope your fine-tuning project together.
Talk with AINinza