AI / FINE-TUNING

Fine-tuning LLMs: LoRA, QLoRA, DPO, GRPO & Obliterated Models

A practitioner's guide to adapting large language models for production: when fine-tuning beats prompting and RAG, parameter-efficient techniques (LoRA, QLoRA, DoRA, GaLore, NEFTune, ReFT), preference alignment (DPO, GRPO, KTO, ORPO), the open-source framework landscape (Unsloth, Axolotl, Llama-Factory, TRL), managed providers, GPU sizing for H100 / RTX 5090 / RTX 4090, and the world of liberated and abliterated models.

By Jose Nobile | Updated 2026-04-27 | 32 min read

1. Fine-tune vs Prompt vs RAG

Fine-tuning is the most expensive, most powerful, and most often misused tool in the LLM toolbox. Before reaching for it, exhaust the cheaper alternatives: better prompting, few-shot examples, structured output, and Retrieval-Augmented Generation (RAG). The rule of thumb: prompt engineering changes what the model does this turn, RAG changes what the model knows, and fine-tuning changes how the model behaves. If your problem is "the model does not know X," that is a RAG problem. If it is "the model cannot follow instructions Y consistently," that is a fine-tuning problem.

Fine-tuning shines when you need: (1) consistent style, format, or persona that prompt-engineering cannot reliably enforce; (2) tool-call accuracy on a fixed schema; (3) compressed prompts to reduce per-call cost; (4) latency reduction by replacing a 70B teacher with an 8B student; (5) classification or extraction with very high precision; or (6) domain-specific reasoning patterns (medical, legal, financial). Fine-tuning is rarely the right answer for "give the model new factual knowledge" — that is what RAG does, with citations.

A 2026-era stack typically combines all three: prompt engineering at the API layer, RAG for fresh facts and citations, and a small fine-tuned model for the specialized behavior. The decision tree below helps you pick the cheapest tool first.

PROMPT

Try Prompting First

Cheapest, fastest iteration. Use system prompts, few-shot examples, structured outputs (JSON schema), and function calling. Modern models (Claude 4.7, GPT-5, Gemini 2.5) follow nuanced instructions reliably. If 5-10 well-crafted examples solve it, do not fine-tune.

RAG

Use RAG for Knowledge

If the gap is "the model does not know our docs/products/policies," build a RAG pipeline. RAG provides verifiable citations, updates instantly when docs change, and avoids hallucinated facts. See our RAG guide.

FINE-TUNE

Fine-tune for Behavior

Fine-tune when you need consistent format, persona, tone, JSON schema adherence, tool-call patterns, or domain-specific reasoning. Also useful for distilling a large teacher into a small fast student to cut latency and cost 5-20x.

DECISION MATRIX

Quick Decision Matrix

Need fresh facts: RAG. Need consistent style/format: fine-tune. Need new skill: fine-tune + RAG. Need lower cost: distill via fine-tune. Need novel behavior on private data: fine-tune. Quick prototype: prompt only.

Goal                       Tool
-------------------------  -----------
Up-to-date facts           RAG
Citations / sourcing       RAG
Consistent JSON output     Fine-tune
Brand voice / persona      Fine-tune
Tool-call reliability      Fine-tune
Domain reasoning           FT + RAG
Cost reduction             FT (distill)
One-off task               Prompt
Behavior on private data   Fine-tune
COST

Cost Comparison

Prompt: zero training cost, higher per-call. RAG: vector DB + retrieval cost, moderate per-call. Fine-tune: $50-5000 training, then 30-70% lower per-call inference (smaller model, shorter prompts). Break-even is typically 1-10M production tokens.

HYBRID

The Production Stack

Production AI systems combine all three: a fine-tuned base model handles format/tone, RAG injects fresh facts with citations, and prompt engineering tunes per-request behavior. None of these are mutually exclusive.

2. PEFT Techniques: LoRA, QLoRA, DoRA, GaLore, NEFTune, ReFT

Full fine-tuning of a 70B model updates all 70 billion weights and requires roughly 1.4TB of GPU memory for optimizer states alone — a non-starter outside of well-funded labs. Parameter-Efficient Fine-Tuning (PEFT) sidesteps this by training a tiny number of new parameters (typically 0.1-1% of the model) while keeping the base model frozen. The result: comparable quality to full fine-tuning at 1/100th the GPU cost.

LoRA (Low-Rank Adaptation) is the workhorse of modern fine-tuning. It freezes the base model and injects pairs of low-rank matrices (rank 8-128) into the attention and MLP layers. QLoRA extends this by quantizing the frozen base to 4-bit NF4, allowing a 70B model to be fine-tuned on a single 48GB GPU. DoRA decomposes weight updates into magnitude and direction for sharper adaptation. GaLore projects gradients into a low-rank subspace for full-parameter training at PEFT-like memory cost. NEFTune adds calibrated noise to embeddings for better generalization. ReFT (Representation Fine-Tuning) intervenes on hidden states rather than weights for an even more parameter-efficient alternative.

LORA

LoRA

The baseline PEFT method. Freezes base weights and trains low-rank adapters (W = W0 + BA where rank(BA) << rank(W0)). Typical rank: 8-64. Trains 0.1-1% of parameters. Adapters are 50-200MB — portable, swappable, mergeable.

# LoRA hyperparameters that matter
r = 16              # rank: 8-64 typical
lora_alpha = 32     # scaling, usually 2*r
lora_dropout = 0.05
target_modules = [
  "q_proj","k_proj","v_proj","o_proj",
  "gate_proj","up_proj","down_proj"
]  # all linear layers (best quality)
QLORA

QLoRA

LoRA + 4-bit NF4 quantization of the frozen base. Cuts VRAM by ~4x, enabling 70B fine-tuning on a single 48GB GPU and 7B-13B on consumer 24GB cards. Tiny quality hit vs FP16 LoRA. Default choice for hobbyist and small-team training.

# QLoRA config
bnb_4bit = BitsAndBytesConfig(
  load_in_4bit=True,
  bnb_4bit_quant_type="nf4",
  bnb_4bit_compute_dtype=torch.bfloat16,
  bnb_4bit_use_double_quant=True
)
DORA

DoRA (Weight-Decomposed LoRA)

Decomposes weight updates into magnitude and direction, then applies LoRA only to direction. Closes the gap with full fine-tuning at low ranks. ~10% slower than LoRA but consistently better. Set use_dora=True in PEFT.

GALORE

GaLore

Gradient Low-Rank Projection. Trains all parameters but projects gradients into a low-rank subspace, cutting optimizer memory ~65%. Enables full-parameter fine-tuning of 7B on a 24GB card. Slightly better quality ceiling than LoRA for deep adaptation.

NEFTUNE

NEFTune

Adds calibrated uniform noise to embeddings during training. A one-line change that consistently improves instruction-tuned model quality on AlpacaEval and MT-Bench. Pair with LoRA/QLoRA for free quality gains. neftune_noise_alpha=5 is a good default.

REFT

ReFT (Representation Fine-Tuning)

Stanford's 2024 method. Instead of updating weights, learns interventions on hidden representations at specific layers. Uses 15-65x fewer parameters than LoRA at comparable quality. Great for very small adapter sizes and research experimentation.

3. Preference Alignment: DPO, GRPO, KTO, ORPO

Supervised fine-tuning (SFT) teaches a model the right output for a given input. Preference alignment teaches it which of two outputs is better. This is what turns a base model into a usable assistant: it is the reason ChatGPT feels different from raw GPT-3, why Claude refuses harmful requests, and why DeepSeek R1 reasons step-by-step. The classic technique was RLHF (PPO over a reward model), but a wave of simpler offline alternatives has largely replaced it for practitioners.

DPO (Direct Preference Optimization) reformulates preference learning as a simple classification loss between chosen and rejected responses, eliminating the reward model and PPO. GRPO (Group Relative Policy Optimization), introduced by DeepSeek for R1, is the SOTA RL method as of 2026 — it samples groups of completions and uses their relative quality as the advantage signal, removing the need for a value model. KTO (Kahneman-Tversky Optimization) lets you train with binary "good/bad" labels per response instead of paired preferences. ORPO (Odds Ratio Preference Optimization) merges SFT and preference learning into a single loss, allowing direct training from a base model in one stage.

DPO

DPO (Direct Preference Optimization)

Replaces RLHF's reward model + PPO with a simple binary cross-entropy loss over chosen/rejected pairs. 10x simpler to implement, more stable, comparable or better quality. Default choice for offline alignment in 2026.

# DPO data format
{"prompt": "...", "chosen": "...", "rejected": "..."}

# beta controls KL constraint
# 0.1 = light alignment, 0.5 = heavy
GRPO

GRPO (DeepSeek R1)

The technique behind DeepSeek R1's reasoning breakthrough. Samples G completions per prompt, computes group-relative advantages, removes the value model entirely. Requires a verifiable reward (math, code, format) but produces emergent reasoning chains. Available in TRL, Unsloth, Axolotl.

KTO

KTO (Kahneman-Tversky)

Trains from unpaired binary feedback (thumbs-up / thumbs-down) instead of paired preferences. Massively easier to collect data — production telemetry directly becomes training data. Often outperforms DPO at scale.

ORPO

ORPO (Single-Stage)

Combines SFT and preference learning in one loss. No separate SFT-then-DPO pipeline — train directly from a base model to an aligned model. Cuts training time roughly in half and avoids the SFT-DPO drift problem.

SimPO / IPO

SimPO & IPO Variants

SimPO removes the reference model from DPO for further memory savings. IPO addresses DPO's overfitting on confident pairs with a regularizer. Both available in TRL. Pick DPO/SimPO for general use, IPO when your preference data is noisy.

RLHF / PPO

Classic RLHF (Legacy)

Reward model + PPO. Still used at frontier labs (OpenAI, Anthropic) for highest-quality alignment, but rarely justified for practitioners. DPO/GRPO get 90% of the benefit at 10% of the engineering cost. Keep RLHF on your radar but reach for DPO first.

4. Open-Source Frameworks

Four frameworks dominate self-hosted fine-tuning in 2026: Unsloth for fastest single-GPU training (2x speed, 70% less VRAM), Axolotl for declarative YAML-driven configs and multi-GPU production runs, Llama-Factory for the broadest model and method coverage with a Web UI, and TRL as the canonical Hugging Face library that everything else builds on. Pick Unsloth for solo experimentation on a single 4090/5090, Axolotl for team workflows with version-controlled configs, Llama-Factory for non-engineers who want a UI, and TRL when you need to write custom training loops.

UNSLOTH

Unsloth

2x faster, 70% less VRAM than vanilla HF. Hand-written Triton kernels for attention and rotary embeddings. Single-GPU only (free tier). Supports LoRA, QLoRA, DPO, GRPO, KTO, ORPO. Best for hobbyists and individuals on RTX 4090/5090.

pip install unsloth
# Loads Llama 3.1 70B in 4-bit on 48GB
from unsloth import FastLanguageModel
model, tok = FastLanguageModel.from_pretrained(
  "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
  max_seq_length=4096, load_in_4bit=True)
AXOLOTL

Axolotl

Declarative YAML configs, FSDP/DeepSpeed multi-GPU, every PEFT method, every alignment method. Production-grade with logging, eval, and HF Hub integration. The default for serious team workflows.

pip install axolotl
accelerate launch -m axolotl.cli.train \
  configs/llama3-lora.yml
LLAMA-FACTORY

Llama-Factory

200+ model architectures, all PEFT methods, all alignment algorithms, plus a Gradio Web UI for non-coders. Strong Chinese-community model coverage (Qwen, GLM, Baichuan). Excellent for exploration and teaching.

TRL

TRL (Hugging Face)

The reference library: SFTTrainer, DPOTrainer, GRPOTrainer, KTOTrainer, ORPOTrainer, RewardTrainer, PPOTrainer. Use directly when writing custom loops or building new methods. All other frameworks wrap TRL internally.

DEEPSPEED / FSDP

DeepSpeed & FSDP

Multi-GPU and multi-node training. DeepSpeed ZeRO-3 shards optimizer, gradients, and parameters; FSDP (PyTorch native) is the modern default. Both integrate cleanly with Axolotl and Llama-Factory. Required for full fine-tuning at 7B+.

TORCHTUNE

torchtune (PyTorch)

PyTorch's official native fine-tuning library. Lean, well-tested recipes for SFT, LoRA, QLoRA, DPO. Lower abstraction than Axolotl but easier to read and modify. Worth considering for teams that prefer pure PyTorch.

5. Managed Fine-tuning Providers

Managed fine-tuning trades flexibility for convenience: you upload a JSONL file, click train, and get back a deployed model with a stable API endpoint. Use managed when you do not want to operate GPUs, when your data fits the provider's format, and when the model selection covers your needs. Self-host when you need exotic methods (GRPO, ReFT), full-parameter training, custom architectures, or strict on-prem data residency.

Pricing as of 2026: OpenAI charges roughly $25/1M training tokens for GPT-4.1 mini and $3/1M for GPT-4o-nano. Together AI charges $1-3/1M training tokens for Llama 3.x. Fireworks offers serverless LoRA hosting (you train, they serve). OpenPipe and Predibase specialize in production fine-tuning with built-in distillation and observability.

OPENAI

OpenAI Fine-tuning

Fine-tune GPT-4.1, GPT-4.1 mini/nano, GPT-4o, GPT-4o-mini and o-series via API or dashboard. Supports SFT, DPO, and reinforcement fine-tuning (RFT) with grader functions for o-series. JSONL chat format. Inference at fine-tuned tier pricing.

ANTHROPIC

Anthropic (Claude)

Claude fine-tuning is available on AWS Bedrock for Claude Haiku models — Anthropic does not offer first-party fine-tuning on api.anthropic.com. Most Claude customization happens via long system prompts, prompt caching, and Tool Use rather than weight fine-tuning.

TOGETHER AI

Together AI

Fine-tune any open model on the Together catalog: Llama 3.x, Qwen 2.5/3, Mistral, DeepSeek. SFT and DPO. Cheap (~$1-3/1M tokens), fast turnaround, OpenAI-compatible inference endpoints. Best price/feature ratio for open-weight fine-tuning.

FIREWORKS

Fireworks AI

Serverless LoRA: deploy adapters with no per-GPU base cost — you pay only for inference. Multi-LoRA serving lets you host 100+ adapters on a single base model. Excellent for SaaS apps with per-customer fine-tuning.

OPENPIPE

OpenPipe

Specializes in distilling GPT-4 / Claude calls into small fine-tuned models. Logs production traffic, builds training set, fine-tunes, deploys. Drop-in OpenAI-compatible. Reduces inference cost 5-30x for high-volume specialized tasks.

PREDIBASE

Predibase

Enterprise platform built around LoRAX (multi-LoRA inference) and Ludwig. SOC 2, VPC deploy, SQL-driven training data, automatic adapter routing. Strong choice for regulated industries.

6. GPU Requirements & Sizing

GPU sizing for fine-tuning depends on three variables: model size, training method (full vs LoRA vs QLoRA), and sequence length. Full fine-tuning needs roughly 16-20 bytes per parameter for weights + optimizer (Adam) + gradients in mixed precision. LoRA needs ~2 bytes/param for the frozen base plus a few hundred MB for adapters. QLoRA needs ~0.5 bytes/param for the 4-bit base.

As of 2026 the practical landscape: RTX 4090 (24GB) handles QLoRA on 7-13B models comfortably. RTX 5090 (32GB) handles QLoRA on 30B and full fine-tunes 7B with 8K context. H100 / H200 (80/141GB) handle QLoRA on 70-100B models and full fine-tunes 13-30B with FSDP. B200 (192GB) and clusters of H100s are needed for 70B+ full fine-tuning. AMD MI300X (192GB) is increasingly viable on ROCm 6.

RTX 4090

RTX 4090 (24GB)

QLoRA on 7B-13B at 4-8K context. LoRA on 7B at FP16. Full FT on 1-3B models. Cheapest serious fine-tuning rig — under $2,000 used. Pair with Unsloth for best performance.

# Comfort zones on 24GB
# QLoRA Llama 3.1 8B  : 8K ctx, batch 4
# QLoRA Llama 3.1 13B : 4K ctx, batch 2
# QLoRA Qwen 3 32B    : 2K ctx, batch 1 (tight)
# Full FT Llama 3.2 1B: 4K ctx, batch 8
RTX 5090

RTX 5090 (32GB)

Blackwell architecture, 32GB GDDR7, FP4/FP8 native support. ~30-40% faster than 4090 on training workloads. Fits QLoRA 30B models comfortably and full FT 7B with 8K context. The new sweet spot for solo practitioners in 2026.

H100 / H200

H100 / H200 (80GB / 141GB)

Datacenter standard. Hopper FP8 cuts memory and speeds training ~2x vs FP16. H100 SXM 80GB handles QLoRA 70B and full FT 13B. H200 with 141GB HBM3e handles QLoRA 100B+ and full FT 30B comfortably. Rent from RunPod, Lambda, Vast at $2-4/hr.

B200 / GB200

B200 / GB200

Blackwell datacenter. 192GB HBM3e per GPU, FP4 native. A single B200 fits a 70B model in BF16 for full fine-tuning. GB200 NVL72 racks (72 B200s) are the platform for frontier-scale training. Available 2025-2026 on AWS, GCP, CoreWeave.

MI300X (AMD)

AMD MI300X (192GB)

192GB HBM3 per GPU on ROCm 6. Holds an entire 70B model in FP16 for single-GPU full fine-tuning. Axolotl and Unsloth both have ROCm support. Cheaper per-GB than H100 on TensorWave and Hot Aisle clouds.

VRAM CHEAT SHEET

Quick VRAM Estimator

Rules of thumb (8K context, batch 1):

QLoRA   : ~param_B * 0.7 GB
LoRA    : ~param_B * 2.5 GB
Full FT : ~param_B * 16-20 GB

# Examples (QLoRA)
Llama 3.1 8B  -> ~6 GB (fits 4090)
Llama 3.1 70B -> ~50 GB (fits H100)
Llama 3.1 405B-> ~280 GB (4x H100)

7. Liberated & Abliterated Models

A subculture of open-weight fine-tuning focuses on removing refusal behaviors from instruction-tuned models — producing what the community calls "uncensored," "liberated," or "abliterated" variants. The legitimate use cases are real: red-teaming, safety research, fiction and creative writing, classification of harmful content (which requires the model to recognize it), and unrestricted research access. The ethical line is that you remain responsible for what you generate: jailbreaking does not absolve misuse.

Three names dominate the space. Pliny the Liberator (L1B3RT4S project) publishes jailbreak prompts and methodologies for nearly every frontier model within hours of release. Maxime Labonne (mlabonne) popularized "abliteration" — a weight-orthogonalization technique that surgically removes the refusal direction from a model's residual stream without retraining, producing a model that retains capability but no longer refuses. Eric Hartford (cognitivecomputations) publishes the Dolphin series: SFT-fine-tuned models on uncensored datasets, from Dolphin-Mistral to Dolphin 3.0 Llama.

ABLITERATION

Abliteration (mlabonne)

A 2024 technique by Maxime Labonne adapting research from Arditi et al. Computes the "refusal direction" by contrasting harmful vs harmless prompt activations, then orthogonalizes every weight matrix against that direction. No retraining, no dataset, ~30 minutes on a single GPU. Result: same capability, no refusals.

# Conceptual abliteration pipeline
# 1. Run harmful + harmless prompts
# 2. Take diff of mean residual streams
# 3. SVD -> refusal direction r
# 4. For each W: W = W - r r^T W
# 5. Save modified model
DOLPHIN

Dolphin (Eric Hartford)

SFT-fine-tuned models on a curated uncensored dataset. Dolphin-3.0 covers Llama 3.1, Qwen 2.5, and Mistral. Sister project: Samantha (companionship-focused). Hosted on Hugging Face under cognitivecomputations. Widely used for agent frameworks needing unrestricted tool use.

L1B3RT4S

Pliny the Liberator (L1B3RT4S)

A GitHub repository of jailbreak prompts and red-team techniques for Claude, GPT, Gemini, Grok, DeepSeek, and Llama. Useful for safety research and prompt-injection defense work. Updated within hours of major model launches.

USE CASES

Legitimate Use Cases

Red-teaming and safety research, fiction writing with mature themes, classifiers that must recognize harmful content, jurisdictionally appropriate content (legal cannabis, security research), and unrestricted academic research. Always log access and apply your own use-case-specific policy layer.

RESPONSIBILITY

Responsibility & Risk

Liberated models do not absolve users of legal or ethical responsibility for outputs. Most jurisdictions hold the operator accountable for the system, not the model. Treat liberated models as raw materials — wrap with policy filters, audit trails, and clear use-case constraints in production.

DETECTION

Reapplying Safety

For production use of an abliterated base, apply a separate safety layer: input/output classifiers (Llama Guard 3, ShieldGemma 2), policy prompts, and tool-call allow-lists. This gives you fine-grained control over what is allowed for your specific use case rather than the model's blanket refusals.

8. Code Examples: Unsloth + Axolotl

The two examples below cover the most common 2026 fine-tuning workflows: a single-GPU Unsloth QLoRA + DPO pipeline (runs on a 4090/5090) and an Axolotl YAML config for multi-GPU production training with FSDP. Both produce a portable LoRA adapter you can merge into a base model, deploy via vLLM or Ollama, or host as a multi-tenant adapter on Fireworks/Predibase.

UNSLOTH

Unsloth: QLoRA SFT on Llama 3.1 8B

End-to-end script that loads a 4-bit Llama 3.1 8B, attaches LoRA adapters, fine-tunes on a chat dataset, and saves a portable adapter. Runs on a single RTX 4090 in ~30 minutes for 1K samples.

from unsloth import FastLanguageModel, is_bfloat16_supported
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    max_seq_length=4096,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model, r=16, lora_alpha=32, lora_dropout=0.05,
    target_modules=["q_proj","k_proj","v_proj","o_proj",
                    "gate_proj","up_proj","down_proj"],
    use_gradient_checkpointing="unsloth",
)

dataset = load_dataset("yahma/alpaca-cleaned", split="train")

trainer = SFTTrainer(
    model=model, tokenizer=tokenizer,
    train_dataset=dataset, dataset_text_field="text",
    max_seq_length=4096,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10, num_train_epochs=1,
        learning_rate=2e-4,
        bf16=is_bfloat16_supported(),
        logging_steps=10, optim="adamw_8bit",
        seed=42, output_dir="outputs",
    ),
)
trainer.train()
model.save_pretrained("llama3-alpaca-lora")
UNSLOTH DPO

Unsloth: DPO Alignment

Run DPO on top of an SFT checkpoint to align with preference pairs. Same VRAM footprint as SFT; just swap SFTTrainer for DPOTrainer.

from trl import DPOTrainer, DPOConfig
# dataset has columns: prompt, chosen, rejected
dpo_trainer = DPOTrainer(
    model=model, ref_model=None,
    tokenizer=tokenizer,
    train_dataset=pref_dataset,
    args=DPOConfig(
        beta=0.1, learning_rate=5e-6,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        num_train_epochs=1,
        bf16=True, output_dir="dpo-out",
    ),
)
dpo_trainer.train()
AXOLOTL YAML

Axolotl: Multi-GPU LoRA YAML

Declarative config for FSDP multi-GPU LoRA training. Run with accelerate launch -m axolotl.cli.train llama3-lora.yml.

base_model: meta-llama/Meta-Llama-3.1-8B-Instruct
model_type: LlamaForCausalLM
tokenizer_type: AutoTokenizer

load_in_4bit: true
adapter: qlora
lora_r: 32
lora_alpha: 64
lora_dropout: 0.05
lora_target_linear: true

datasets:
  - path: tatsu-lab/alpaca
    type: alpaca
sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

gradient_checkpointing: true
gradient_accumulation_steps: 4
micro_batch_size: 2
num_epochs: 3
optimizer: adamw_torch
learning_rate: 2e-4
lr_scheduler: cosine
warmup_ratio: 0.05
bf16: auto
flash_attention: true

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_sync_module_states: true
  fsdp_offload_params: false
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer

output_dir: ./outputs/llama3-lora
hub_model_id: yourname/llama3-lora-alpaca
GRPO

GRPO: DeepSeek-Style Reasoning

Train reasoning behavior with verifiable rewards (e.g. math correctness). Available in TRL 0.12+ and Unsloth.

from trl import GRPOTrainer, GRPOConfig

def reward_fn(completions, **kw):
    # Verifiable reward: 1.0 if final answer matches
    return [1.0 if extract_answer(c)==truth else 0.0
            for c,truth in zip(completions, kw["truth"])]

trainer = GRPOTrainer(
    model=model, tokenizer=tokenizer,
    reward_funcs=[reward_fn],
    train_dataset=math_dataset,
    args=GRPOConfig(
        num_generations=8,           # G in GRPO
        max_completion_length=2048,
        learning_rate=1e-6,
        per_device_train_batch_size=1,
        gradient_accumulation_steps=8,
        output_dir="grpo-out",
    ),
)
trainer.train()
DEPLOY

Deploying the Adapter

After training, you can serve the LoRA adapter via vLLM (production), Ollama (local), or merge into the base for a single artifact. See our Ollama guide for local serving and the AI Inference guide for vLLM/TGI.

# Merge LoRA into base
from peft import PeftModel
merged = PeftModel.from_pretrained(base, "lora-out").merge_and_unload()
merged.save_pretrained("merged-model")

# Serve with vLLM
vllm serve merged-model --tensor-parallel-size 2

# Or convert to GGUF for Ollama
python llama.cpp/convert-hf-to-gguf.py merged-model
ollama create my-model -f Modelfile
EVAL

Evaluation Checklist

Always evaluate on held-out data with both task metrics (accuracy, BLEU, exec-pass) and capability regressions (MMLU, IFEval, HumanEval). lm-evaluation-harness is the standard. Watch for catastrophic forgetting: a fine-tune that wins on your task but tanks general capability is rarely worth it.

Related Technologies