How to Fine-Tune NVIDIA Nemotron: Complete Step by Step Guide
By Braincuber Team
Published on May 6, 2026
NVIDIA Nemotron-3 is NVIDIA's open model family built for reasoning, coding, chat, and agentic AI workflows. The Nano version is designed for efficiency, making it ideal for hands-on experimentation on consumer GPUs like the RTX 3090. This complete beginner guide covers fine-tuning Nemotron-3-Nano-4B on a psychology Q&A dataset using LoRA, TRL, and Hugging Face.
What You'll Learn:
- How to set up the environment for Nemotron-3-Nano fine-tuning
- Loading and processing datasets for TRL fine-tuning
- Configuring LoRA adapters for efficient training
- Training and saving LoRA adapters with Hugging Face
- Comparing model responses before and after fine-tuning
What is NVIDIA Nemotron-3?
NVIDIA Nemotron-3 is a family of open models that includes Nano, Super, and Ultra variants. The Nano version (4B parameters) is specifically designed for efficiency, allowing developers to fine-tune on accessible GPU setups without requiring massive compute resources.
The key update with Nemotron-3 is its hybrid architecture that combines Mamba-based components with transformer layers. This design delivers strong performance while keeping inference and fine-tuning practical for consumer hardware.
Prerequisites & Hardware Requirements
Before starting, ensure you have the following:
| Requirement | Details |
|---|---|
| GPU | NVIDIA RTX 3090 (24GB VRAM) or equivalent. Reduce batch sizes for smaller GPUs. |
| CUDA Version | CUDA 12.8 with PyTorch 2.7.1 (required for Mamba compatibility) |
| Python | Python 3.12+ recommended |
| Hugging Face Token | Set HF_TOKEN environment variable for model access |
Step by Step Fine-Tuning Guide
Set Up the Environment
Install the correct PyTorch stack with CUDA 12.8 support. The Mamba-related packages (mamba_ssm, causal_conv1d) require specific versions that work with this PyTorch build.
Load and Process the Dataset
Load the psychology Q&A dataset from Hugging Face, create train/validation/test splits, and format it for TRL fine-tuning with system prompts and chat templates.
Load Nemotron-3 Base Model
Download the NVIDIA-Nemotron-3-Nano-4B-BF16 model and tokenizer from Hugging Face. Configure padding, generation settings, and disable caching for training.
Configure LoRA and Training
Set up LoRA configuration targeting all linear layers with rank=32 and alpha=64. Define SFTConfig with batch sizes, learning rate, epochs, and evaluation strategy.
Train and Save the Adapter
Run SFTTrainer with LoRA configuration, monitor training/validation loss, save the best adapter locally, and push to Hugging Face Hub for sharing.
Compare Model Responses
Generate sample responses from both base and fine-tuned models. Compare outputs to verify that fine-tuning improved alignment with the target response style.
Environment Setup
First, install the correct PyTorch stack with CUDA 12.8 and the Mamba-related packages. This step is critical because the Nemotron-3 Nano uses a hybrid architecture.
%%capture
!pip install -U packaging ninja
# Replace the current PyTorch stack with the CUDA 12.8 build
!pip uninstall -y torch torchvision torchaudio triton
!pip install "torch==2.7.1" "torchvision==0.22.1" "torchaudio==2.7.1" --index-url https://download.pytorch.org/whl/cu128
!pip install -U "transformers==4.56.2" tokenizers "trl==0.22.2" accelerate datasets peft pandas tqdm huggingface_hub safetensors
!pip install -U --no-build-isolation "mamba_ssm==2.2.5" "causal_conv1d==1.5.2"
After installing packages, verify that CUDA is available and check your GPU specifications:
import os
import platform
import torch
print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"PyTorch CUDA build: {torch.version.cuda}")
print(f"CUDA available: {torch.cuda.is_available()}")
if not torch.cuda.is_available():
raise RuntimeError("CUDA is not available. Select a RunPod PyTorch image with GPU support.")
for idx in range(torch.cuda.device_count()):
props = torch.cuda.get_device_properties(idx)
total_gb = props.total_memory / 1024**3
print(f"GPU {idx}: {props.name} ({total_gb:.1f} GB VRAM, capability {props.major}.{props.minor})")
if torch.cuda.get_device_properties(0).total_memory < 24 * 1024**3:
print("Warning: this 4B LoRA notebook is tuned for GPUs with at least 24GB VRAM.")
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
Loading the Dataset
Load the psychology Q&A dataset from Hugging Face and create train/validation/test splits. We use the response_j column as the target answer.
from datasets import DatasetDict, load_dataset
DATASET_ID = "jkhedri/psychology-dataset"
TRAIN_LIMIT = 8000
VALIDATION_LIMIT = 800
TEST_LIMIT = 300
SEED = 42
raw_dataset = load_dataset(DATASET_ID)
raw_train = raw_dataset["train"].shuffle(seed=SEED)
split_1 = raw_train.train_test_split(test_size=0.15, seed=SEED)
split_2 = split_1["test"].train_test_split(test_size=0.33, seed=SEED)
def maybe_limit(split, limit):
if limit is None:
return split
return split.select(range(min(limit, len(split))))
dataset = DatasetDict({
"train": maybe_limit(split_1["train"], TRAIN_LIMIT),
"validation": maybe_limit(split_2["train"], VALIDATION_LIMIT),
"test": maybe_limit(split_2["test"], TEST_LIMIT),
})
print(dataset)
Formatting for TRL Fine-Tuning
Convert the dataset into prompt-completion format with system prompts. The system prompt defines the model's behavior: be supportive, avoid hidden reasoning, and provide practical suggestions.
SYSTEM_PROMPT = """/no_think
You are a supportive psychology question-answering assistant.
Do not include hidden reasoning, thinking traces, tags, or tags in the final answer.
Respond with empathy, practical coping suggestions, and clear next steps.
Give a complete answer in 2-4 short paragraphs or a brief paragraph plus 3-5 practical bullets.
Do not diagnose the user or claim to replace a licensed mental health professional.
If the user may be in immediate danger or crisis, encourage contacting local emergency services or a trusted crisis hotline.
Keep the answer safe, specific, and directly relevant to the user's question without being overly brief."""
USER_TEMPLATE = "Question:
{question}"
def clean_text(value):
return " ".join(str(value).strip().split())
def to_prompt_completion(example):
question = clean_text(example["question"])
answer = clean_text(example["response_j"])
return {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": USER_TEMPLATE.format(question=question)},
],
"completion": [{"role": "assistant", "content": answer}],
"chat_template_kwargs": {"enable_thinking": False},
}
sft_dataset = dataset.map(to_prompt_completion, remove_columns=dataset["train"].column_names)
print(sft_dataset["train"][0])
Loading Nemotron-3 Model
Download the NVIDIA-Nemotron-3-Nano-4B-BF16 model and configure it for training. Set padding tokens, disable caching, and configure generation settings.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16"
OUTPUT_DIR = "./nemotron-3-nano-4b-bf16-psychology-qa-lora"
MAX_SEQ_LENGTH = 1024
tokenizer = AutoTokenizer.from_pretrained(
MODEL_ID, token=hf_token, trust_remote_code=True, use_fast=True
)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
base_model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, token=hf_token, trust_remote_code=True,
torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="eager"
)
base_model.config.use_cache = False
base_model.config.pad_token_id = tokenizer.pad_token_id
base_model.config.eos_token_id = tokenizer.eos_token_id
base_model.generation_config.pad_token_id = tokenizer.pad_token_id
base_model.generation_config.eos_token_id = tokenizer.eos_token_id
base_model.generation_config.use_cache = False
base_model.generation_config.do_sample = False
base_model.generation_config.top_p = None
base_model.generation_config.min_new_tokens = None
base_model.generation_config.repetition_penalty = 1.08
base_model.generation_config.no_repeat_ngram_size = 4
LoRA Configuration
Configure LoRA (Low-Rank Adaptation) to efficiently fine-tune the model. LoRA adds small trainable adapters instead of updating all model parameters, reducing memory requirements.
from peft import LoraConfig
base_model.gradient_checkpointing_enable()
base_model.config.use_cache = False
lora_config = LoraConfig(
r=32,
lora_alpha=64,
lora_dropout=0.1,
bias="none",
task_type="CAUSAL_LM",
target_modules="all-linear",
)
Training Configuration
Define the SFTConfig with training parameters. Key settings include batch size (8), gradient accumulation (8 steps), learning rate (5e-5), and 2 training epochs.
from trl import SFTConfig, SFTTrainer
training_args = SFTConfig(
output_dir=OUTPUT_DIR,
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=8,
learning_rate=5e-5,
weight_decay=0.01,
lr_scheduler_type="linear",
warmup_ratio=0.05,
num_train_epochs=2,
logging_steps=50,
eval_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
save_total_limit=2,
load_best_model_at_end=True,
metric_for_best_model="eval_loss",
greater_is_better=False,
gradient_checkpointing=True,
bf16=True,
fp16=False,
tf32=True,
max_length=MAX_SEQ_LENGTH,
packing=False,
completion_only_loss=True,
remove_unused_columns=False,
dataloader_num_workers=4,
optim="adamw_torch_fused",
report_to="none",
seed=SEED,
)
Training the Model
Create the SFTTrainer with LoRA configuration and start training. The trainer will monitor training/validation loss and save the best model.
trainer = SFTTrainer(
model=base_model,
args=training_args,
train_dataset=sft_dataset["train"],
eval_dataset=sft_dataset["validation"],
peft_config=lora_config,
processing_class=tokenizer,
)
trainable_params = sum(p.numel() for p in trainer.model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in trainer.model.parameters())
print(f"Trainable LoRA parameters: {trainable_params:,}")
print(f"Trainable percentage: {100 * trainable_params / all_params:.4f}%")
train_result = trainer.train()
trainer.model.eval()
trainer.model.config.use_cache = False
trainer.model.generation_config.use_cache = False
Saving and Uploading
After training, save the LoRA adapter locally and upload it to Hugging Face Hub for sharing with the community.
# Save locally
trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
# Upload to Hugging Face
HUB_REPO_ID = "kingabzpro/nemotron-3-nano-4b-bf16-psychology-qa-lora"
trainer.model.push_to_hub(HUB_REPO_ID, private=False)
tokenizer.push_to_hub(HUB_REPO_ID, private=False)
print(f"Model uploaded to: https://huggingface.co/{HUB_REPO_ID}")
Key Considerations
Environment Setup
Use a clean environment to avoid Mamba package conflicts. The mamba_ssm dependency can break existing setups if not installed correctly.
Memory Requirements
4B models can run on 24GB GPUs with LoRA. For 12B+ models, memory becomes a constraint without quantization techniques.
Model Comparison
Fine-tuned models align better with dataset style but base models may give more detailed responses in some cases.
Consumer GPU Access
Nemotron-3 Nano makes LLM fine-tuning accessible to developers with consumer GPUs like RTX 3090/4090.
Important Notes
Quantization (4-bit QLoRA) is not directly supported for Nemotron-3 Nano due to its hybrid architecture. Load the full BF16 model for LoRA fine-tuning. Also, always use a clean Python environment to avoid mamba_ssm conflicts.
Frequently Asked Questions
What GPU do I need to fine-tune Nemotron-3 Nano?
A 24GB GPU (RTX 3090/4090) is recommended. Reduce batch sizes if using GPUs with less VRAM. The notebook is tuned for 24GB but can work with less.
Can I use QLoRA with 4-bit quantization?
Not directly. Nemotron-3 Nano's hybrid architecture requires loading the full BF16 model. For 4-bit training, consider other models like Qwen or Llama.
How many trainable parameters does LoRA add?
With r=32 and target_modules="all-linear", LoRA adds a small percentage of trainable parameters compared to the full 4B model, making training efficient.
How do I use the fine-tuned adapter?
Load the base model, then apply the LoRA adapter using peft's PeftModel.from_pretrained(). The adapter is available on Hugging Face Hub.
Why use Nemotron-3 Nano over larger models?
Nano is efficient, runs on consumer GPUs, and delivers strong performance for its size. Ideal for experimentation and domain-specific fine-tuning.
Need Help with AI Model Fine-Tuning?
Our AI experts can help you fine-tune LLMs for your specific use case. Get started with a free consultation today.
