NVIDIA Nemotron-3 is NVIDIA's open model family built for reasoning, coding, chat, and agentic AI workflows. The Nano version is designed for efficiency, making it ideal for hands-on experimentation on consumer GPUs like the RTX 3090. This complete beginner guide covers fine-tuning Nemotron-3-Nano-4B on a psychology Q&A dataset using LoRA, TRL, and Hugging Face.

What You'll Learn:

How to set up the environment for Nemotron-3-Nano fine-tuning
Loading and processing datasets for TRL fine-tuning
Configuring LoRA adapters for efficient training
Training and saving LoRA adapters with Hugging Face
Comparing model responses before and after fine-tuning

What is NVIDIA Nemotron-3?

NVIDIA Nemotron-3 is a family of open models that includes Nano, Super, and Ultra variants. The Nano version (4B parameters) is specifically designed for efficiency, allowing developers to fine-tune on accessible GPU setups without requiring massive compute resources.

The key update with Nemotron-3 is its hybrid architecture that combines Mamba-based components with transformer layers. This design delivers strong performance while keeping inference and fine-tuning practical for consumer hardware.

Prerequisites & Hardware Requirements

Before starting, ensure you have the following:

Requirement	Details
GPU	NVIDIA RTX 3090 (24GB VRAM) or equivalent. Reduce batch sizes for smaller GPUs.
CUDA Version	CUDA 12.8 with PyTorch 2.7.1 (required for Mamba compatibility)
Python	Python 3.12+ recommended
Hugging Face Token	Set HF_TOKEN environment variable for model access

Step by Step Fine-Tuning Guide

Set Up the Environment

Install the correct PyTorch stack with CUDA 12.8 support. The Mamba-related packages (mamba_ssm, causal_conv1d) require specific versions that work with this PyTorch build.

Load and Process the Dataset

Load the psychology Q&A dataset from Hugging Face, create train/validation/test splits, and format it for TRL fine-tuning with system prompts and chat templates.

Load Nemotron-3 Base Model

Download the NVIDIA-Nemotron-3-Nano-4B-BF16 model and tokenizer from Hugging Face. Configure padding, generation settings, and disable caching for training.

Configure LoRA and Training

Set up LoRA configuration targeting all linear layers with rank=32 and alpha=64. Define SFTConfig with batch sizes, learning rate, epochs, and evaluation strategy.

Train and Save the Adapter

Run SFTTrainer with LoRA configuration, monitor training/validation loss, save the best adapter locally, and push to Hugging Face Hub for sharing.

Compare Model Responses

Generate sample responses from both base and fine-tuned models. Compare outputs to verify that fine-tuning improved alignment with the target response style.

Environment Setup

First, install the correct PyTorch stack with CUDA 12.8 and the Mamba-related packages. This step is critical because the Nemotron-3 Nano uses a hybrid architecture.

Install Dependencies

%%capture
!pip install -U packaging ninja

# Replace the current PyTorch stack with the CUDA 12.8 build
!pip uninstall -y torch torchvision torchaudio triton

!pip install "torch==2.7.1" "torchvision==0.22.1" "torchaudio==2.7.1" --index-url https://download.pytorch.org/whl/cu128

!pip install -U "transformers==4.56.2" tokenizers "trl==0.22.2" accelerate datasets peft pandas tqdm huggingface_hub safetensors

!pip install -U --no-build-isolation "mamba_ssm==2.2.5" "causal_conv1d==1.5.2"

After installing packages, verify that CUDA is available and check your GPU specifications:

Verify GPU Setup

import os
import platform
import torch

print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"PyTorch CUDA build: {torch.version.cuda}")
print(f"CUDA available: {torch.cuda.is_available()}")

if not torch.cuda.is_available():
    raise RuntimeError("CUDA is not available. Select a RunPod PyTorch image with GPU support.")

for idx in range(torch.cuda.device_count()):
    props = torch.cuda.get_device_properties(idx)
    total_gb = props.total_memory / 1024**3
    print(f"GPU {idx}: {props.name} ({total_gb:.1f} GB VRAM, capability {props.major}.{props.minor})")

if torch.cuda.get_device_properties(0).total_memory < 24 * 1024**3:
    print("Warning: this 4B LoRA notebook is tuned for GPUs with at least 24GB VRAM.")

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

Loading the Dataset

Load the psychology Q&A dataset from Hugging Face and create train/validation/test splits. We use the response_j column as the target answer.

Load Dataset

from datasets import DatasetDict, load_dataset

DATASET_ID = "jkhedri/psychology-dataset"
TRAIN_LIMIT = 8000
VALIDATION_LIMIT = 800
TEST_LIMIT = 300
SEED = 42

raw_dataset = load_dataset(DATASET_ID)
raw_train = raw_dataset["train"].shuffle(seed=SEED)

split_1 = raw_train.train_test_split(test_size=0.15, seed=SEED)
split_2 = split_1["test"].train_test_split(test_size=0.33, seed=SEED)

def maybe_limit(split, limit):
    if limit is None:
        return split
    return split.select(range(min(limit, len(split))))

dataset = DatasetDict({
    "train": maybe_limit(split_1["train"], TRAIN_LIMIT),
    "validation": maybe_limit(split_2["train"], VALIDATION_LIMIT),
    "test": maybe_limit(split_2["test"], TEST_LIMIT),
})

print(dataset)

Formatting for TRL Fine-Tuning

Convert the dataset into prompt-completion format with system prompts. The system prompt defines the model's behavior: be supportive, avoid hidden reasoning, and provide practical suggestions.

Format Dataset

SYSTEM_PROMPT = """/no_think
You are a supportive psychology question-answering assistant.
Do not include hidden reasoning, thinking traces,  tags, or  tags in the final answer.
Respond with empathy, practical coping suggestions, and clear next steps.
Give a complete answer in 2-4 short paragraphs or a brief paragraph plus 3-5 practical bullets.
Do not diagnose the user or claim to replace a licensed mental health professional.
If the user may be in immediate danger or crisis, encourage contacting local emergency services or a trusted crisis hotline.
Keep the answer safe, specific, and directly relevant to the user's question without being overly brief."""

USER_TEMPLATE = "Question:

{question}"

def clean_text(value):
    return " ".join(str(value).strip().split())

def to_prompt_completion(example):
    question = clean_text(example["question"])
    answer = clean_text(example["response_j"])
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_TEMPLATE.format(question=question)},
        ],
        "completion": [{"role": "assistant", "content": answer}],
        "chat_template_kwargs": {"enable_thinking": False},
    }

sft_dataset = dataset.map(to_prompt_completion, remove_columns=dataset["train"].column_names)
print(sft_dataset["train"][0])

Loading Nemotron-3 Model

Download the NVIDIA-Nemotron-3-Nano-4B-BF16 model and configure it for training. Set padding tokens, disable caching, and configure generation settings.

Load Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16"
OUTPUT_DIR = "./nemotron-3-nano-4b-bf16-psychology-qa-lora"
MAX_SEQ_LENGTH = 1024

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_ID, token=hf_token, trust_remote_code=True, use_fast=True
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, token=hf_token, trust_remote_code=True,
    torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="eager"
)

base_model.config.use_cache = False
base_model.config.pad_token_id = tokenizer.pad_token_id
base_model.config.eos_token_id = tokenizer.eos_token_id
base_model.generation_config.pad_token_id = tokenizer.pad_token_id
base_model.generation_config.eos_token_id = tokenizer.eos_token_id
base_model.generation_config.use_cache = False
base_model.generation_config.do_sample = False
base_model.generation_config.top_p = None
base_model.generation_config.min_new_tokens = None
base_model.generation_config.repetition_penalty = 1.08
base_model.generation_config.no_repeat_ngram_size = 4

LoRA Configuration

Configure LoRA (Low-Rank Adaptation) to efficiently fine-tune the model. LoRA adds small trainable adapters instead of updating all model parameters, reducing memory requirements.

LoRA Config

from peft import LoraConfig

base_model.gradient_checkpointing_enable()
base_model.config.use_cache = False

lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear",
)

Training Configuration

Define the SFTConfig with training parameters. Key settings include batch size (8), gradient accumulation (8 steps), learning rate (5e-5), and 2 training epochs.

Training Config

from trl import SFTConfig, SFTTrainer

training_args = SFTConfig(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    weight_decay=0.01,
    lr_scheduler_type="linear",
    warmup_ratio=0.05,
    num_train_epochs=2,
    logging_steps=50,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    gradient_checkpointing=True,
    bf16=True,
    fp16=False,
    tf32=True,
    max_length=MAX_SEQ_LENGTH,
    packing=False,
    completion_only_loss=True,
    remove_unused_columns=False,
    dataloader_num_workers=4,
    optim="adamw_torch_fused",
    report_to="none",
    seed=SEED,
)

Training the Model

Create the SFTTrainer with LoRA configuration and start training. The trainer will monitor training/validation loss and save the best model.

Start Training

trainer = SFTTrainer(
    model=base_model,
    args=training_args,
    train_dataset=sft_dataset["train"],
    eval_dataset=sft_dataset["validation"],
    peft_config=lora_config,
    processing_class=tokenizer,
)

trainable_params = sum(p.numel() for p in trainer.model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in trainer.model.parameters())
print(f"Trainable LoRA parameters: {trainable_params:,}")
print(f"Trainable percentage: {100 * trainable_params / all_params:.4f}%")

train_result = trainer.train()
trainer.model.eval()
trainer.model.config.use_cache = False
trainer.model.generation_config.use_cache = False

Saving and Uploading

After training, save the LoRA adapter locally and upload it to Hugging Face Hub for sharing with the community.

Save & Upload

# Save locally
trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

# Upload to Hugging Face
HUB_REPO_ID = "kingabzpro/nemotron-3-nano-4b-bf16-psychology-qa-lora"

trainer.model.push_to_hub(HUB_REPO_ID, private=False)
tokenizer.push_to_hub(HUB_REPO_ID, private=False)
print(f"Model uploaded to: https://huggingface.co/{HUB_REPO_ID}")

Key Considerations

Environment Setup

Use a clean environment to avoid Mamba package conflicts. The mamba_ssm dependency can break existing setups if not installed correctly.

Memory Requirements

4B models can run on 24GB GPUs with LoRA. For 12B+ models, memory becomes a constraint without quantization techniques.

Model Comparison

Fine-tuned models align better with dataset style but base models may give more detailed responses in some cases.

Consumer GPU Access

Nemotron-3 Nano makes LLM fine-tuning accessible to developers with consumer GPUs like RTX 3090/4090.

Important Notes

Quantization (4-bit QLoRA) is not directly supported for Nemotron-3 Nano due to its hybrid architecture. Load the full BF16 model for LoRA fine-tuning. Also, always use a clean Python environment to avoid mamba_ssm conflicts.

Frequently Asked Questions

What GPU do I need to fine-tune Nemotron-3 Nano?

A 24GB GPU (RTX 3090/4090) is recommended. Reduce batch sizes if using GPUs with less VRAM. The notebook is tuned for 24GB but can work with less.

Can I use QLoRA with 4-bit quantization?

Not directly. Nemotron-3 Nano's hybrid architecture requires loading the full BF16 model. For 4-bit training, consider other models like Qwen or Llama.

How many trainable parameters does LoRA add?

With r=32 and target_modules="all-linear", LoRA adds a small percentage of trainable parameters compared to the full 4B model, making training efficient.

How do I use the fine-tuned adapter?

Load the base model, then apply the LoRA adapter using peft's PeftModel.from_pretrained(). The adapter is available on Hugging Face Hub.

Why use Nemotron-3 Nano over larger models?

Nano is efficient, runs on consumer GPUs, and delivers strong performance for its size. Ideal for experimentation and domain-specific fine-tuning.

Need Help with AI Model Fine-Tuning?

Our AI experts can help you fine-tune LLMs for your specific use case. Get started with a free consultation today.

What You'll Learn:

How to set up the environment for Nemotron-3-Nano fine-tuning
Loading and processing datasets for TRL fine-tuning
Configuring LoRA adapters for efficient training
Training and saving LoRA adapters with Hugging Face
Comparing model responses before and after fine-tuning

What is NVIDIA Nemotron-3?

Prerequisites & Hardware Requirements

Before starting, ensure you have the following:

Requirement	Details
GPU	NVIDIA RTX 3090 (24GB VRAM) or equivalent. Reduce batch sizes for smaller GPUs.
CUDA Version	CUDA 12.8 with PyTorch 2.7.1 (required for Mamba compatibility)
Python	Python 3.12+ recommended
Hugging Face Token	Set HF_TOKEN environment variable for model access

Step by Step Fine-Tuning Guide

Set Up the Environment

Install the correct PyTorch stack with CUDA 12.8 support. The Mamba-related packages (mamba_ssm, causal_conv1d) require specific versions that work with this PyTorch build.

Load and Process the Dataset

Load the psychology Q&A dataset from Hugging Face, create train/validation/test splits, and format it for TRL fine-tuning with system prompts and chat templates.

Load Nemotron-3 Base Model

Download the NVIDIA-Nemotron-3-Nano-4B-BF16 model and tokenizer from Hugging Face. Configure padding, generation settings, and disable caching for training.

Configure LoRA and Training

Set up LoRA configuration targeting all linear layers with rank=32 and alpha=64. Define SFTConfig with batch sizes, learning rate, epochs, and evaluation strategy.

Train and Save the Adapter

Run SFTTrainer with LoRA configuration, monitor training/validation loss, save the best adapter locally, and push to Hugging Face Hub for sharing.

Compare Model Responses

Generate sample responses from both base and fine-tuned models. Compare outputs to verify that fine-tuning improved alignment with the target response style.

Environment Setup

First, install the correct PyTorch stack with CUDA 12.8 and the Mamba-related packages. This step is critical because the Nemotron-3 Nano uses a hybrid architecture.

Install Dependencies

%%capture
!pip install -U packaging ninja

# Replace the current PyTorch stack with the CUDA 12.8 build
!pip uninstall -y torch torchvision torchaudio triton

!pip install "torch==2.7.1" "torchvision==0.22.1" "torchaudio==2.7.1" --index-url https://download.pytorch.org/whl/cu128

!pip install -U "transformers==4.56.2" tokenizers "trl==0.22.2" accelerate datasets peft pandas tqdm huggingface_hub safetensors

!pip install -U --no-build-isolation "mamba_ssm==2.2.5" "causal_conv1d==1.5.2"

After installing packages, verify that CUDA is available and check your GPU specifications:

Verify GPU Setup

import os
import platform
import torch

print(f"Python: {platform.python_version()}")
print(f"PyTorch: {torch.__version__}")
print(f"PyTorch CUDA build: {torch.version.cuda}")
print(f"CUDA available: {torch.cuda.is_available()}")

if not torch.cuda.is_available():
    raise RuntimeError("CUDA is not available. Select a RunPod PyTorch image with GPU support.")

for idx in range(torch.cuda.device_count()):
    props = torch.cuda.get_device_properties(idx)
    total_gb = props.total_memory / 1024**3
    print(f"GPU {idx}: {props.name} ({total_gb:.1f} GB VRAM, capability {props.major}.{props.minor})")

if torch.cuda.get_device_properties(0).total_memory < 24 * 1024**3:
    print("Warning: this 4B LoRA notebook is tuned for GPUs with at least 24GB VRAM.")

torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True

Loading the Dataset

Load the psychology Q&A dataset from Hugging Face and create train/validation/test splits. We use the response_j column as the target answer.

Load Dataset

from datasets import DatasetDict, load_dataset

DATASET_ID = "jkhedri/psychology-dataset"
TRAIN_LIMIT = 8000
VALIDATION_LIMIT = 800
TEST_LIMIT = 300
SEED = 42

raw_dataset = load_dataset(DATASET_ID)
raw_train = raw_dataset["train"].shuffle(seed=SEED)

split_1 = raw_train.train_test_split(test_size=0.15, seed=SEED)
split_2 = split_1["test"].train_test_split(test_size=0.33, seed=SEED)

def maybe_limit(split, limit):
    if limit is None:
        return split
    return split.select(range(min(limit, len(split))))

dataset = DatasetDict({
    "train": maybe_limit(split_1["train"], TRAIN_LIMIT),
    "validation": maybe_limit(split_2["train"], VALIDATION_LIMIT),
    "test": maybe_limit(split_2["test"], TEST_LIMIT),
})

print(dataset)

Formatting for TRL Fine-Tuning

Convert the dataset into prompt-completion format with system prompts. The system prompt defines the model's behavior: be supportive, avoid hidden reasoning, and provide practical suggestions.

Format Dataset

SYSTEM_PROMPT = """/no_think
You are a supportive psychology question-answering assistant.
Do not include hidden reasoning, thinking traces,  tags, or  tags in the final answer.
Respond with empathy, practical coping suggestions, and clear next steps.
Give a complete answer in 2-4 short paragraphs or a brief paragraph plus 3-5 practical bullets.
Do not diagnose the user or claim to replace a licensed mental health professional.
If the user may be in immediate danger or crisis, encourage contacting local emergency services or a trusted crisis hotline.
Keep the answer safe, specific, and directly relevant to the user's question without being overly brief."""

USER_TEMPLATE = "Question:

{question}"

def clean_text(value):
    return " ".join(str(value).strip().split())

def to_prompt_completion(example):
    question = clean_text(example["question"])
    answer = clean_text(example["response_j"])
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": USER_TEMPLATE.format(question=question)},
        ],
        "completion": [{"role": "assistant", "content": answer}],
        "chat_template_kwargs": {"enable_thinking": False},
    }

sft_dataset = dataset.map(to_prompt_completion, remove_columns=dataset["train"].column_names)
print(sft_dataset["train"][0])

Loading Nemotron-3 Model

Download the NVIDIA-Nemotron-3-Nano-4B-BF16 model and configure it for training. Set padding tokens, disable caching, and configure generation settings.

Load Model

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "nvidia/NVIDIA-Nemotron-3-Nano-4B-BF16"
OUTPUT_DIR = "./nemotron-3-nano-4b-bf16-psychology-qa-lora"
MAX_SEQ_LENGTH = 1024

tokenizer = AutoTokenizer.from_pretrained(
    MODEL_ID, token=hf_token, trust_remote_code=True, use_fast=True
)

if tokenizer.pad_token is None:
    tokenizer.pad_token = tokenizer.eos_token

tokenizer.padding_side = "right"

base_model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, token=hf_token, trust_remote_code=True,
    torch_dtype=torch.bfloat16, device_map="auto", attn_implementation="eager"
)

base_model.config.use_cache = False
base_model.config.pad_token_id = tokenizer.pad_token_id
base_model.config.eos_token_id = tokenizer.eos_token_id
base_model.generation_config.pad_token_id = tokenizer.pad_token_id
base_model.generation_config.eos_token_id = tokenizer.eos_token_id
base_model.generation_config.use_cache = False
base_model.generation_config.do_sample = False
base_model.generation_config.top_p = None
base_model.generation_config.min_new_tokens = None
base_model.generation_config.repetition_penalty = 1.08
base_model.generation_config.no_repeat_ngram_size = 4

LoRA Configuration

Configure LoRA (Low-Rank Adaptation) to efficiently fine-tune the model. LoRA adds small trainable adapters instead of updating all model parameters, reducing memory requirements.

LoRA Config

from peft import LoraConfig

base_model.gradient_checkpointing_enable()
base_model.config.use_cache = False

lora_config = LoraConfig(
    r=32,
    lora_alpha=64,
    lora_dropout=0.1,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear",
)

Training Configuration

Define the SFTConfig with training parameters. Key settings include batch size (8), gradient accumulation (8 steps), learning rate (5e-5), and 2 training epochs.

Training Config

from trl import SFTConfig, SFTTrainer

training_args = SFTConfig(
    output_dir=OUTPUT_DIR,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=8,
    learning_rate=5e-5,
    weight_decay=0.01,
    lr_scheduler_type="linear",
    warmup_ratio=0.05,
    num_train_epochs=2,
    logging_steps=50,
    eval_strategy="steps",
    eval_steps=50,
    save_strategy="steps",
    save_steps=100,
    save_total_limit=2,
    load_best_model_at_end=True,
    metric_for_best_model="eval_loss",
    greater_is_better=False,
    gradient_checkpointing=True,
    bf16=True,
    fp16=False,
    tf32=True,
    max_length=MAX_SEQ_LENGTH,
    packing=False,
    completion_only_loss=True,
    remove_unused_columns=False,
    dataloader_num_workers=4,
    optim="adamw_torch_fused",
    report_to="none",
    seed=SEED,
)

Training the Model

Create the SFTTrainer with LoRA configuration and start training. The trainer will monitor training/validation loss and save the best model.

Start Training

trainer = SFTTrainer(
    model=base_model,
    args=training_args,
    train_dataset=sft_dataset["train"],
    eval_dataset=sft_dataset["validation"],
    peft_config=lora_config,
    processing_class=tokenizer,
)

trainable_params = sum(p.numel() for p in trainer.model.parameters() if p.requires_grad)
all_params = sum(p.numel() for p in trainer.model.parameters())
print(f"Trainable LoRA parameters: {trainable_params:,}")
print(f"Trainable percentage: {100 * trainable_params / all_params:.4f}%")

train_result = trainer.train()
trainer.model.eval()
trainer.model.config.use_cache = False
trainer.model.generation_config.use_cache = False

Saving and Uploading

After training, save the LoRA adapter locally and upload it to Hugging Face Hub for sharing with the community.

Save & Upload

# Save locally
trainer.model.save_pretrained(OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

# Upload to Hugging Face
HUB_REPO_ID = "kingabzpro/nemotron-3-nano-4b-bf16-psychology-qa-lora"

trainer.model.push_to_hub(HUB_REPO_ID, private=False)
tokenizer.push_to_hub(HUB_REPO_ID, private=False)
print(f"Model uploaded to: https://huggingface.co/{HUB_REPO_ID}")

Key Considerations

Environment Setup

Use a clean environment to avoid Mamba package conflicts. The mamba_ssm dependency can break existing setups if not installed correctly.

Memory Requirements

4B models can run on 24GB GPUs with LoRA. For 12B+ models, memory becomes a constraint without quantization techniques.

Model Comparison

Fine-tuned models align better with dataset style but base models may give more detailed responses in some cases.

Consumer GPU Access

Nemotron-3 Nano makes LLM fine-tuning accessible to developers with consumer GPUs like RTX 3090/4090.

Important Notes

Frequently Asked Questions

What GPU do I need to fine-tune Nemotron-3 Nano?

A 24GB GPU (RTX 3090/4090) is recommended. Reduce batch sizes if using GPUs with less VRAM. The notebook is tuned for 24GB but can work with less.

Can I use QLoRA with 4-bit quantization?

Not directly. Nemotron-3 Nano's hybrid architecture requires loading the full BF16 model. For 4-bit training, consider other models like Qwen or Llama.

How many trainable parameters does LoRA add?

With r=32 and target_modules="all-linear", LoRA adds a small percentage of trainable parameters compared to the full 4B model, making training efficient.

How do I use the fine-tuned adapter?

Load the base model, then apply the LoRA adapter using peft's PeftModel.from_pretrained(). The adapter is available on Hugging Face Hub.

Why use Nemotron-3 Nano over larger models?

Nano is efficient, runs on consumer GPUs, and delivers strong performance for its size. Ideal for experimentation and domain-specific fine-tuning.

Need Help with AI Model Fine-Tuning?

Our AI experts can help you fine-tune LLMs for your specific use case. Get started with a free consultation today.

How to Fine-Tune NVIDIA Nemotron: Complete Step by Step Guide

What is NVIDIA Nemotron-3?

Prerequisites & Hardware Requirements

Step by Step Fine-Tuning Guide

Set Up the Environment

Load and Process the Dataset

Load Nemotron-3 Base Model

Configure LoRA and Training

Train and Save the Adapter

Compare Model Responses

Environment Setup

Loading the Dataset

Formatting for TRL Fine-Tuning

Loading Nemotron-3 Model

LoRA Configuration

Training Configuration

Training the Model

Saving and Uploading

Key Considerations

Environment Setup

Memory Requirements

Model Comparison

Consumer GPU Access

Frequently Asked Questions

What GPU do I need to fine-tune Nemotron-3 Nano?

Can I use QLoRA with 4-bit quantization?

How many trainable parameters does LoRA add?

How do I use the fine-tuned adapter?

Why use Nemotron-3 Nano over larger models?

Need Help with AI Model Fine-Tuning?

Need This Implemented in Your Project?

How to Fine-Tune NVIDIA Nemotron: Complete Step by Step Guide

What is NVIDIA Nemotron-3?

Prerequisites & Hardware Requirements

Step by Step Fine-Tuning Guide

Set Up the Environment

Load and Process the Dataset

Load Nemotron-3 Base Model

Configure LoRA and Training

Train and Save the Adapter

Compare Model Responses

Environment Setup

Loading the Dataset

Formatting for TRL Fine-Tuning

Loading Nemotron-3 Model

LoRA Configuration

Training Configuration

Training the Model

Saving and Uploading

Key Considerations

Environment Setup

Memory Requirements

Model Comparison

Consumer GPU Access

Frequently Asked Questions

What GPU do I need to fine-tune Nemotron-3 Nano?

Can I use QLoRA with 4-bit quantization?

How many trainable parameters does LoRA add?

How do I use the fine-tuned adapter?

Why use Nemotron-3 Nano over larger models?

Need Help with AI Model Fine-Tuning?

Need This Implemented in Your Project?