How much data do I need to fine-tune an LLM?

Quality matters far more than quantity. For instruction-tuning tasks, 500 to 5,000 high-quality examples can produce meaningful improvements. For complex domain adaptation — such as training a model to write legal briefs or medical summaries — 5,000 to 50,000 examples deliver stronger results. Diminishing returns set in beyond 100,000 examples for most use cases. Always start with a small, clean dataset and scale up based on evaluation metrics.

What is the difference between fine-tuning and prompt engineering?

Prompt engineering adjusts how you ask the model to behave at inference time without changing any model weights. Fine-tuning actually updates the model's parameters on your data, permanently altering its behavior. Use prompt engineering first because it's cheaper and faster to iterate. Move to fine-tuning only when prompting cannot consistently achieve the quality, style, or domain knowledge you need.

How much does LLM fine-tuning cost?

Costs depend on the model size, training method, and infrastructure. Fine-tuning a 7B parameter model with LoRA on a single A100 GPU can cost as little as $50 to $200 for a training run. Full fine-tuning of a 70B model may require eight A100 GPUs and cost $2,000 to $10,000 per run. Cloud providers like AWS, GCP, and Azure offer on-demand GPU pricing, while managed platforms like OpenAI charge per token processed during training.

What is LoRA and why is it popular?

LoRA (Low-Rank Adaptation) freezes the original model weights and injects small trainable matrices into each transformer layer. This reduces the number of trainable parameters by 90-99%, cutting GPU memory requirements and training time dramatically. A 70B model that would normally need eight A100 GPUs for full fine-tuning can be trained with LoRA on a single A100 using QLoRA (quantised LoRA). The quality trade-off is minimal for most tasks.

Should I fine-tune or use RAG?

Fine-tuning is best for teaching the model a consistent style, tone, or reasoning pattern — things that are hard to express in a prompt. RAG is best for grounding answers in specific, up-to-date documents. They are not mutually exclusive: many production systems fine-tune a model for domain-specific reasoning and then use RAG to supply current factual context at inference time.

Which base model should I fine-tune?

For open-source deployments, Llama 3 (8B and 70B) and Mistral (7B) are the most popular starting points due to strong baseline performance and permissive licenses. For proprietary API-based fine-tuning, OpenAI's GPT-4o-mini offers a good cost-quality balance. Choose based on your deployment constraints: if the model must run on-premises for data privacy, open-source is the only path.

How do I evaluate a fine-tuned model?

Start with task-specific automated metrics — BLEU or ROUGE for summarisation, exact-match accuracy for classification, and pass@k for code generation. Complement these with human evaluation using a rubric that scores for correctness, style adherence, and safety. Always maintain a hold-out test set that was never seen during training, and run regression tests against the base model to confirm improvements.

What is catastrophic forgetting and how do I prevent it?

Catastrophic forgetting occurs when fine-tuning on a narrow dataset causes the model to lose general capabilities it had before training. Prevent it by using a low learning rate (1e-5 to 5e-5), limiting the number of training epochs (usually 1-3), mixing a small percentage of general-purpose data into your training set, and using LoRA instead of full fine-tuning so most original weights remain unchanged.

How long does a fine-tuning training run take?

A LoRA fine-tuning run on a 7B model with 10,000 examples typically completes in one to four hours on a single A100 GPU. Full fine-tuning of the same model takes six to twelve hours. Larger models scale roughly linearly: a 70B model with LoRA might take eight to sixteen hours on four A100 GPUs. The total wall-clock time for a project also includes dataset preparation and multiple evaluation cycles.

Can I fine-tune a model and then deploy it on-premises?

Yes, provided you use an open-source base model with a permissive license. Models like Llama 3, Mistral, and Phi can be fine-tuned and deployed anywhere. Serve the model using inference frameworks like vLLM or Text Generation Inference (TGI), and apply quantisation (GPTQ, AWQ, or GGUF) to reduce GPU memory requirements for production. A quantised 7B model can run on a single consumer GPU with 16 GB of VRAM.

Pillar Guide

LLM Fine-Tuning Playbook: When, Why & How

A comprehensive guide to adapting large language models for your specific domain, style, and performance requirements. From dataset preparation through production deployment, everything you need to make fine-tuning work at enterprise scale.

View Fine-Tuning Services Discuss Your Use Case

Table of Contents

What Is LLM Fine-Tuning

Fine-tuning is the process of taking a pre-trained large language model and further training it on a curated dataset to specialise its behaviour for a specific task, domain, or style. Unlike pre-training — which teaches a model language from scratch using trillions of tokens — fine-tuning adjusts existing knowledge using hundreds to thousands of carefully crafted examples. The result is a model that retains broad language understanding while excelling at your particular use case.

It is important to distinguish fine-tuning from two related but different techniques. Prompt engineering adjusts model behaviour at inference time by crafting better instructions — no model weights change. RAG (retrieval-augmented generation) supplies external knowledge at query time to ground responses in specific documents. Fine-tuning changes the model itself, permanently altering how it reasons, writes, and responds. Each technique has its place, and production systems often combine all three.

When Generic Models Fall Short

General-purpose LLMs are trained on broad internet data. They write competent prose, answer general knowledge questions, and follow instructions reasonably well. But they struggle with tasks that require consistent adherence to a specific writing style, deep understanding of domain-specific terminology, structured output formats that must be followed precisely, or reasoning patterns unique to your industry. A legal AI that drafts contracts in your firm's house style, a medical AI that uses ICD-10 codes correctly, or a financial AI that follows your institution's compliance language — these require fine-tuning. Learn more in our LLM fine-tuning glossary entry.

When to Fine-Tune (Decision Framework)

Fine-tuning is powerful but not always the right choice. It requires dataset curation, GPU compute, evaluation infrastructure, and ongoing maintenance. Before committing, run through this decision framework to confirm that fine-tuning is the right investment for your use case.

Fine-Tune When...

Consistent Style or Tone

Your application requires outputs that match a specific brand voice, writing style, or formatting convention that prompting alone cannot reliably achieve. Fine-tuning bakes the style into the model weights, eliminating prompt-level workarounds.

Domain-Specific Reasoning

The model needs to understand specialised terminology, apply domain-specific logic, or follow workflows unique to your industry (legal analysis, medical coding, financial modelling). Generic models lack these capabilities even with detailed prompts.

Latency Constraints

Fine-tuning a smaller model (7B-13B parameters) to match a larger model's quality on your specific task lets you deploy faster inference with lower cost. A fine-tuned 7B model often outperforms a prompted 70B model on narrow tasks.

Data Privacy Requirements

Fine-tuning an open-source model and deploying it on-premises keeps all data within your infrastructure. No prompts or completions are sent to third-party APIs, satisfying strict data residency and sovereignty requirements.

Don't Fine-Tune When...

Knowledge Changes Frequently

If the information the model needs changes weekly or daily, fine-tuning cannot keep up. Use RAG instead — it retrieves current documents at query time without retraining. Fine-tuned knowledge is frozen at training time.

Small or Low-Quality Dataset

Fine-tuning with fewer than 200 examples or with noisy, inconsistent data is likely to degrade model performance rather than improve it. Invest in dataset quality first. If you cannot curate 500+ high-quality examples, prompt engineering is safer.

General Tasks Work Fine

If the base model with good prompting already meets your quality bar, fine-tuning adds cost and complexity without meaningful improvement. Always establish a prompted baseline before deciding to fine-tune.

No Evaluation Framework

Without clear metrics and test sets, you cannot tell if fine-tuning improved things. Build your evaluation pipeline before your training pipeline. For a detailed comparison, read our RAG vs fine-tuning guide.

Dataset Preparation

Dataset quality is the single biggest determinant of fine-tuning success. A small, clean, well-structured dataset consistently outperforms a large, noisy one. Plan to spend 60-70% of your fine-tuning project timeline on data preparation — this is where the real work happens.

Data Collection Strategies

Start by mining your existing workflows. Customer support logs, internal documentation, expert-written reports, and quality-reviewed outputs from your current AI tools are all rich sources. Structure each example as an instruction-response pair: the instruction describes what the model should do, and the response is the ideal output. For classification tasks, include the input text and the correct label. For generation tasks, provide both the prompt context and the gold-standard output.

Quality Over Quantity

Research consistently shows that 1,000 high-quality examples outperform 10,000 mediocre ones. Each example should be reviewed by a domain expert and meet these criteria: the instruction is unambiguous, the response is correct and complete, the formatting matches your target output, and there are no factual errors or inconsistencies. Deduplicate your dataset to avoid the model memorising repeated patterns at the expense of generalisation.

Data Formatting

Most fine-tuning frameworks expect data in a standard chat format with system, user, and assistant messages. For instruction tuning, use the Alpaca format (instruction, input, output) or the ShareGPT format (multi-turn conversations). Consistency is critical — every example should follow the exact same template. Validate your dataset programmatically before training to catch formatting errors that would silently degrade model quality.

Synthetic Data Generation

When real examples are scarce, use a stronger model (like GPT-4 or Claude) to generate synthetic training data. Provide the model with a few real examples and ask it to create variations. This technique — sometimes called model distillation — can bootstrap a dataset from 50 real examples to 2,000 synthetic ones. Always validate synthetic data with human review; models occasionally introduce subtle errors that propagate through training.

Training Methods

The training method you choose affects GPU requirements, training time, and the quality ceiling of your fine-tuned model. The trend in the industry is toward parameter-efficient methods (LoRA, QLoRA) that deliver near-full-fine-tuning quality at a fraction of the compute cost. Understanding the trade-offs helps you make the right call for your constraints.

Method	Parameters Trained	GPU Memory	Quality Delta	Best For
Full Fine-Tuning	All	High (8x A100 for 70B)	Highest	Maximum performance when budget and infrastructure allow
LoRA	1-10%	Moderate (1-2x A100 for 70B)	Near-full	Production fine-tuning with limited GPU budget
QLoRA	1-10% (quantised base)	Low (1x A100 for 70B)	Good	Fine-tuning large models on consumer or single-GPU hardware
Instruction Tuning	Varies	Varies	Task-specific	Teaching models to follow instructions in a specific format
RLHF / DPO	All or LoRA	High	Alignment-focused	Aligning model outputs with human preferences and safety

Full Fine-Tuning

Full fine-tuning updates every parameter in the model. This gives the optimizer maximum flexibility to adapt the model but requires significant GPU memory — roughly 4x the model size in mixed-precision training. For a 70B parameter model, that means approximately 280 GB of GPU memory, requiring multiple A100 80GB GPUs. Use full fine-tuning only when you need maximum quality and have the infrastructure to support it.

LoRA (Low-Rank Adaptation)

LoRA freezes the original model weights and injects small trainable matrices (rank 8-64) into the attention layers. Only these matrices are updated during training, reducing trainable parameters by 90-99%. The frozen base model is loaded in full precision or half-precision, and only the adapter weights need gradient computation. This cuts memory requirements by 50-75% compared to full fine-tuning. LoRA adapters are typically 10-100 MB in size, making them easy to version, share, and swap.

QLoRA (Quantised LoRA)

QLoRA takes LoRA further by quantising the frozen base model to 4-bit precision using the NormalFloat4 (NF4) data type. The LoRA adapters still train in 16-bit precision for stability, but the base model footprint shrinks by 4x. A 70B model that requires 140 GB in half-precision fits in roughly 35 GB with 4-bit quantisation, making it trainable on a single A100 80GB GPU. Quality is slightly below full LoRA but remains strong for most practical tasks.

RLHF and DPO

Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) align model outputs with human preferences. Instead of showing the model the "right answer," you present pairs of outputs and indicate which one humans preferred. RLHF trains a reward model and uses PPO to optimise the LLM against it. DPO simplifies this by directly optimising the language model on preference pairs without a separate reward model. Both are powerful for safety alignment, reducing harmful outputs, and steering the model toward preferred response patterns.

Model Selection

Your choice of base model determines the performance ceiling, deployment options, and licensing constraints of your fine-tuned system. The open-source ecosystem has matured dramatically — models like Llama 3 and Mistral now approach proprietary model quality on many benchmarks, and their permissive licenses enable on-premises deployment.

Model	License	Strengths
Llama 3 (8B / 70B)	Meta Community	Strong baseline, large community, flexible deployment
Mistral (7B / 8x7B)	Apache 2.0	Excellent quality-to-size ratio, mixture-of-experts option
Phi-3 (3.8B / 14B)	MIT	Small footprint, strong reasoning for its size, edge deployable
GPT-4o / GPT-4o-mini	Proprietary API	Highest baseline quality, simple API-based fine-tuning
Claude (via partner)	Proprietary API	Strong instruction following, long context window

Decision Criteria

Consider four factors when choosing a base model. First, licensing: if the model must run on your infrastructure without usage restrictions, choose an open-source model with an Apache 2.0 or MIT license. Second, size: larger models generally perform better but cost more to serve. For latency-sensitive applications, a fine-tuned 7B model often delivers better value than a prompted 70B model. Third, community support: models with active fine-tuning communities (Llama, Mistral) have more tutorials, adapters, and debugging resources. Fourth, baseline performance: always benchmark the base model on your task with prompting alone before fine-tuning — if it already scores 90% on your evaluation, the marginal gain from fine-tuning may not justify the investment.

Training Infrastructure

Fine-tuning requires GPU compute, and the cost and availability of GPUs remains a primary constraint for most organisations. The good news is that parameter-efficient methods (LoRA, QLoRA) have dramatically lowered the hardware bar. Understanding your options helps you budget accurately and avoid over-provisioning.

GPU Requirements by Model Size

7B Models

Full fine-tuning: 1x A100 40GB. LoRA: 1x A100 40GB or RTX 4090 24GB. QLoRA: 1x RTX 4090 or even RTX 3090 24GB. Training time: 1-4 hours for 10K examples.

13B-30B Models

Full fine-tuning: 2-4x A100 80GB. LoRA: 1x A100 80GB. QLoRA: 1x A100 40GB or 2x RTX 4090. Training time: 4-12 hours for 10K examples.

70B+ Models

Full fine-tuning: 8x A100 80GB. LoRA: 2x A100 80GB. QLoRA: 1x A100 80GB. Training time: 8-24 hours for 10K examples. Use DeepSpeed or FSDP for distributed training.

Cloud Options

All major cloud providers offer GPU instances suitable for fine-tuning. AWS provides p4d (A100) and p5 (H100) instances. GCP offers A2 (A100) and A3 (H100) machine types. Azure has NC-series (A100) and ND-series (H100) VMs. For cost-sensitive experiments, spot/preemptible instances reduce costs by 60-70% with the risk of interruption — use checkpointing to resume training after preemption. Specialised platforms like Lambda Labs, RunPod, and Together AI often offer lower per-GPU-hour pricing than hyperscalers.

MLOps Tooling

Track experiments with Weights & Biases or MLflow to log hyperparameters, training curves, and evaluation metrics across runs. Use Hugging Face's Transformers and TRL libraries as your training framework — they support LoRA, QLoRA, DPO, and RLHF out of the box. Version your datasets, model checkpoints, and adapter weights in a model registry (Hugging Face Hub, MLflow, or a custom S3 bucket with versioning). Reproducibility is critical for debugging and for satisfying audit requirements.

Evaluation & Testing

A fine-tuned model is only as good as your ability to measure its improvement over the base model. Build your evaluation framework before you start training — it defines the success criteria that determine whether a training run is worth deploying. Without rigorous evaluation, you are flying blind.

Task-Specific Metrics

Choose metrics that directly measure task performance. For text generation: BLEU, ROUGE, and BERTScore measure output similarity to reference texts. For classification: accuracy, precision, recall, and F1 score on a held-out test set. For code generation: pass@k (percentage of generated solutions that pass unit tests). For structured output: exact-match accuracy and schema validation rates. Always compare the fine-tuned model against the prompted base model on the same test set.

Human Evaluation

Automated metrics capture surface-level quality but miss nuance that matters to end users. Run blind A/B evaluations where domain experts compare outputs from the base model and the fine-tuned model without knowing which is which. Score on a rubric covering correctness, style adherence, completeness, and safety. A sample of 100-200 evaluation pairs provides statistically meaningful results. Track win rates and use them to decide whether to ship the fine-tuned model.

Regression Testing

Fine-tuning can improve target-task performance while degrading general capabilities — a phenomenon known as catastrophic forgetting. Maintain a regression test suite that covers general language tasks (summarisation, Q&A, instruction following) alongside your domain-specific tests. If the fine-tuned model drops more than 5% on general benchmarks, reduce the learning rate, limit training epochs, or switch to LoRA to preserve more of the original model's capabilities.

A/B Testing in Production

Lab evaluations do not always predict production performance. Deploy the fine-tuned model alongside the base model and route a percentage of traffic to each. Measure user satisfaction (thumbs up/down, task completion rates), response latency, and cost per query. Gradually increase traffic to the fine-tuned model as confidence builds. Maintain the ability to instantly roll back to the base model if issues emerge.

Production Deployment

Deploying a fine-tuned model to production is more than copying weights to a server. You need an inference stack that delivers low latency at acceptable cost, monitoring to detect quality drift, versioning to manage model updates, and rollback capability for when things go wrong.

Model Serving Frameworks

vLLM is the current standard for high-throughput LLM inference, using PagedAttention to maximise GPU utilisation. Text Generation Inference (TGI) by Hugging Face offers a production-ready HTTP API with built-in batching and streaming. NVIDIA Triton Inference Server provides the most flexibility for multi-model deployments. For LoRA-based fine-tunes, vLLM supports serving multiple LoRA adapters from a single base model, switching adapters per-request with minimal overhead.

Quantisation for Inference

Quantisation reduces model precision from 16-bit to 8-bit or 4-bit, cutting GPU memory requirements and improving throughput. GPTQ and AWQ are the most popular post-training quantisation methods, with AWQ generally preserving slightly more quality. GGUF format (used by llama.cpp) enables CPU inference for smaller models. A quantised 7B model runs on a single GPU with 16 GB of VRAM, making deployment accessible on consumer hardware or cost-effective cloud instances.

Monitoring and Observability

Monitor three dimensions in production: system metrics (GPU utilisation, latency p50/p95/p99, throughput, error rates), quality metrics (user feedback scores, automated eval scores on a rolling sample), and cost metrics (tokens processed, cost per query, monthly spend). Set up alerts for latency spikes, quality drops below threshold, and budget overruns. Log all inputs and outputs (with appropriate data handling) for debugging and future training data collection.

Model Versioning and Rollback

Treat model deployments like software releases. Tag each model version with a semantic version number, the training dataset hash, and evaluation scores. Store model artefacts in a versioned registry. Implement blue-green or canary deployment strategies so you can instantly roll back to the previous version if the new model underperforms. Keep at least the two most recent production versions warm for rapid switching.

LLM Fine-Tuning FAQ

Answers to the most common questions about fine-tuning large language models for enterprise use.

About the Authors

This fine-tuning guide is authored by ML engineers who have trained and deployed custom models for enterprises across finance, healthcare, and technology.

Pravin Prasad

Chief Executive Officer

Founder of AINinza with extensive experience leading AI-driven transformation programs across banking, retail, and logistics.

AINinza AI Team

AI Solutions Architects

Our multidisciplinary team of AI engineers and solution architects share practical insights from enterprise AI deployments across industries.

Neha Sharma

Technical Writer

Technical writer at AINinza covering AI trends, implementation guides, and best practices for enterprise AI adoption.

Related Guides

Explore complementary resources to build a complete enterprise AI strategy.

LLM Fine-Tuning Services

End-to-end fine-tuning delivery from dataset curation to production deployment.

Read Guide

RAG Implementation Playbook

When RAG is the better choice and how to combine it with fine-tuning.

Read Guide

RAG vs Fine-Tuning

Decision framework for choosing between RAG, fine-tuning, or both.

Read Guide

Ready to Fine-Tune Your LLM?

Whether you need to adapt an open-source model for your domain or optimise a proprietary model's performance, our team brings the dataset expertise, training infrastructure, and evaluation rigour you need. Let's scope your fine-tuning project together.

Talk with AINinza