How to Fine-Tune Gemma 4: Complete Step by Step Guide
By Braincuber Team
Published on April 20, 2026
Google has introduced Gemma 4, describing it as its most intelligent open model family so far, built for strong reasoning and agentic workflows. Gemma models are designed to be flexible across environments, with official support and tooling for local development, cloud deployment, and model customization, which makes them a strong choice for fine-tuning projects.
In this complete tutorial, we will fine-tune Gemma 4 E4B-it on a human emotion classification dataset from Hugging Face. We will set up a 3090 GPU environment, load and inspect the dataset, prepare and format the data for supervised fine-tuning, load the base model, run baseline evaluation before training, fine-tune the model, and then evaluate its performance again after training.
What You'll Learn:
- Setting up a RunPod environment with 3090 GPU
- Loading and preparing the emotion classification dataset
- Formatting data for Gemma 4 fine-tuning with chat templates
- Loading Gemma 4 with 4-bit quantization
- Evaluating baseline model performance
- Fine-tuning with LoRA adapters
- Evaluating and comparing post-fine-tuning results
1. Setting Up the Environment
Start by launching a new RunPod instance, and make sure your account has at least $5 in credit before you begin. For this tutorial, choose a 3090 GPU pod and select the latest PyTorch template.
Before deploying, open the template settings and make a few updates. Increase both the container disk and volume disk to 40 GB so you have enough space for the model, dataset, cached files, and training checkpoints.
You should also add your Hugging Face token as an environment variable. You can generate this token from Settings > Access Tokens in your Hugging Face account.
Deploy RunPod Instance
Choose a 3090 GPU pod with the latest PyTorch template and deploy.
Hugging Face Token Required
You need a Hugging Face token with access to the gated gemma-4-E4B-it model. Generate one from Settings > Access Tokens.
Once these settings are in place, go ahead and deploy the pod. It may take a minute or two for the instance to start. After it is ready, open the JupyterLab interface so you can begin working inside the environment.
The first thing to do in JupyterLab is launch the new Python notebook and install all the required Python packages. Run the following command in a notebook cell:
%%capture
!pip install -U transformers accelerate datasets trl peft bitsandbytes scikit-learn huggingface_hub
These packages will cover the full workflow, including loading the dataset, preparing the model, fine-tuning, and evaluation.
The last step is to sign in to the Hugging Face Hub using your saved token. This gives you access to the gated model and also makes it easier to upload files, create repositories, and push your fine-tuned model later.
import os
from huggingface_hub import login
hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
raise ValueError("Set HF_TOKEN in the RunPod environment before running this notebook.")
login(token=hf_token)
print("Logged in to Hugging Face.")
2. Load and Prepare the Emotion Dataset
Now that the environment is ready, the next step is to load the emotion dataset from Hugging Face and prepare smaller splits for training and evaluation.
For this tutorial, we are not using the full dataset. Instead, we create limited train, validation, and test splits so the fine-tuning process stays faster and easier to run on a single GPU.
from datasets import load_dataset, DatasetDict
TRAIN_LIMIT = 4000
VALIDATION_LIMIT = 400
TEST_LIMIT = 400
EVAL_LIMIT = 400
raw_dataset = load_dataset("dair-ai/emotion")
def maybe_limit(split, limit):
split = split.shuffle(seed=42)
if limit is None:
return split
return split.select(range(min(limit, len(split))))
dataset = DatasetDict({
"train": maybe_limit(raw_dataset["train"], TRAIN_LIMIT),
"validation": maybe_limit(raw_dataset["validation"], VALIDATION_LIMIT),
"test": maybe_limit(raw_dataset["test"], TEST_LIMIT),
})
dataset
The final dataset contains 4,000 training examples, 400 validation examples, and 400 test examples.
| Split | Examples |
|---|---|
| Train | 4,000 |
| Validation | 400 |
| Test | 400 |
Next, we look at the label names stored in the dataset. These are the emotion classes the model will learn to predict.
label_names = dataset["train"].features["label"].names
label_names
This shows that the task has six emotion categories: sadness, joy, love, anger, fear, and surprise.
| Label ID | Emotion |
|---|---|
| 0 | sadness |
| 1 | joy |
| 2 | love |
| 3 | anger |
| 4 | fear |
| 5 | surprise |
3. Formatting Data for Gemma 4 Fine-Tuning
Before we can fine-tune the model, we need to convert the dataset into the format Gemma 4 will use during training.
Instead of passing only raw text and labels, we structure each example as a short chat interaction with a system message, a user message, and the expected assistant response.
The system prompt tells the model exactly what task it should perform. In this case, we want the model to act as an emotion classification assistant and return only one of the six allowed labels.
SYSTEM_PROMPT = """You are an emotion classification assistant.
Read the user's text and answer with exactly one label.
Only choose from: sadness, joy, love, anger, fear, surprise.
Return only the label and nothing else."""
Now we create a function to format the data into the prompt-completion format required for supervised fine-tuning:
def to_prompt_completion(example):
text = example["text"]
label = label_names[example["label"]]
return {
"prompt": [
{
"role": "system",
"content": SYSTEM_PROMPT,
},
{
"role": "user",
"content": f"Classify the emotion of this text:
{text}",
},
],
"completion": [
{
"role": "assistant",
"content": label,
}
],
}
sft_dataset = dataset.map(to_prompt_completion, remove_columns=dataset["train"].column_names)
4. Load Gemma E4B-it With 4-Bit Quantization
Now we can load Gemma 4 E4B-it and prepare it for fine-tuning. Since this is a relatively large model, we load it with 4-bit quantization to reduce memory usage and make it easier to run on a 3090 GPU. We also use bfloat16 as the compute type, which helps keep the setup efficient.
4-Bit Quantization
Reduces model size by 4x, enabling large models to run on consumer GPUs with minimal accuracy loss.
bfloat16 Precision
Modern floating point format that balances precision and performance for deep learning training.
We start by importing the required libraries and defining the main model settings:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
MODEL_ID = "google/gemma-4-E4B-it"
MODEL_DTYPE = torch.bfloat16
USE_4BIT = True
Next, we enable CUDA optimizations and load the tokenizer:
if torch.cuda.is_available():
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True
processor = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if processor.pad_token is None:
processor.pad_token = processor.eos_token
Now we prepare the quantization settings and model loading arguments:
bnb_config = None
model_kwargs = {
"device_map": "auto",
}
if USE_4BIT:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=MODEL_DTYPE,
)
model_kwargs["quantization_config"] = bnb_config
else:
model_kwargs["torch_dtype"] = MODEL_DTYPE
Finally, we load the model and align its configuration with the tokenizer:
base_model = AutoModelForCausalLM.from_pretrained(MODEL_ID, **model_kwargs)
base_model.config.use_cache = False
base_model.config.pad_token_id = processor.pad_token_id
base_model.config.bos_token_id = processor.bos_token_id
base_model.config.eos_token_id = processor.eos_token_id
base_model.generation_config.pad_token_id = processor.pad_token_id
base_model.generation_config.bos_token_id = processor.bos_token_id
base_model.generation_config.eos_token_id = processor.eos_token_id
print(f"Base model loaded with 4-bit={USE_4BIT} and dtype={MODEL_DTYPE}.")
5. Evaluate the Base Model
Before fine-tuning, it is useful to evaluate the base model first so we have a clear baseline to compare against later.
In this section, we define a few helper functions that generate predictions, extract valid emotion labels, and run evaluation on the test split.
import re
LABEL_PATTERN = re.compile(r"(sadness|joy|love|anger|fear|surprise)", re.IGNORECASE)
def extract_label(raw_text: str) -> str:
raw_text = raw_text.strip().lower()
match = LABEL_PATTERN.search(raw_text)
if match:
return match.group(1)
first_token = raw_text.split()[0].strip(".,!?:;"'()[]{}") if raw_text.split() else ""
return first_token
def generate_label(model, processor, user_text, system_prompt, max_new_tokens=4):
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Classify the emotion of this text:
{user_text}"},
]
device = next(model.parameters()).device
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt",
).to(device)
input_len = inputs["input_ids"].shape[-1]
with torch.no_grad():
outputs = model.generate(
**inputs, max_new_tokens=max_new_tokens, do_sample=False,
pad_token_id=processor.pad_token_id, eos_token_id=processor.eos_token_id,
)
raw_pred = processor.decode(outputs[0][input_len:], skip_special_tokens=True).strip()
return extract_label(raw_pred)
Now we run the baseline evaluation on the test split:
from sklearn.metrics import accuracy_score, f1_score
import pandas as pd
from tqdm.auto import tqdm
VALID_LABELS = set(label_names)
def evaluate_model(model, processor, split="test", limit=EVAL_LIMIT):
y_true, y_pred, rows = [], [], []
raw_source = dataset[split]
if limit is not None:
raw_source = raw_source.select(range(min(limit, len(raw_source))))
model.eval()
for ex in tqdm(raw_source, desc=f"Evaluating {split}", leave=False):
true_label = label_names[ex["label"]]
raw_pred_label = generate_label(model, processor, ex["text"], SYSTEM_PROMPT)
pred_label = raw_pred_label if raw_pred_label in VALID_LABELS else "INVALID"
y_true.append(true_label)
y_pred.append(pred_label)
rows.append({"text": ex["text"], "true_label": true_label, "pred_label": pred_label, "correct": true_label == pred_label})
metrics = {
"accuracy": accuracy_score(y_true, y_pred),
"macro_f1": f1_score(y_true, y_pred, labels=label_names, average="macro", zero_division=0),
"invalid_predictions": sum(1 for p in y_pred if p == "INVALID"),
"evaluated_examples": len(y_true),
}
df = pd.DataFrame(rows)
return metrics, df
pre_metrics, pre_preds = evaluate_model(base_model, processor, "test")
print(pre_metrics)
These baseline results show that the untuned model already performs reasonably well, but there is still room for improvement. The accuracy is around 58.25%, the macro F1 score is around 0.42, and the model produced 33 invalid predictions.
| Metric | Pre-Fine-Tuning |
|---|---|
| Accuracy | 58.25% |
| Macro F1 | 0.42 |
| Invalid Predictions | 33 |
6. Fine-Tune Gemma 4 With LoRA
Now that we have the baseline results, we can fine-tune Gemma 4 using LoRA. LoRA is a parameter-efficient fine-tuning method, which means we do not update the full model. Instead, we attach a small number of trainable adapter weights on top of the base model. This makes training much lighter and more practical on a single GPU.
Configure LoRA
Define LoRA settings including rank 16, alpha 32, and dropout 0.05.
Set Up Trainer
Configure training arguments including batch size, learning rate, and epochs.
Train and Save
Run training for one epoch and save the fine-tuned adapter.
We start by defining the LoRA configuration:
from peft import LoraConfig
lora_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules="all-linear"
)
Next, we define the training configuration and set up the trainer:
from trl import SFTConfig, SFTTrainer
training_args = SFTConfig(
output_dir="./gemma4-emotion-lora",
per_device_train_batch_size=8,
per_device_eval_batch_size=8,
gradient_accumulation_steps=2,
learning_rate=1e-4,
weight_decay=0.01,
lr_scheduler_type="linear",
warmup_steps=50,
num_train_epochs=1,
logging_steps=50,
eval_strategy="steps",
gradient_checkpointing=True,
bf16=True,
fp16=False,
tf32=True,
max_length=256,
packing=False,
completion_only_loss=True,
remove_unused_columns=False,
dataloader_num_workers=2,
optim="paged_adamw_8bit",
report_to="none",
)
Now we initialize the trainer with the LoRA configuration:
from peft import PeftModel
if isinstance(base_model, PeftModel):
base_model = base_model.unload()
base_model.config.use_cache = False
trainer = SFTTrainer(
model=base_model,
train_dataset=sft_dataset["train"],
eval_dataset=sft_dataset["validation"],
peft_config=lora_config,
args=training_args,
processing_class=processor,
)
Now we can start training. The training typically takes about 9 minutes on a 3090 GPU:
trainable_params = 0
for param in trainer.model.parameters():
if param.requires_grad:
trainable_params += param.numel()
print(f"Trainable LoRA parameters: {trainable_params:,}")
train_result = trainer.train()
trainer.model.eval()
trainer.model.config.use_cache = True
Once training is complete, we can save the adapter and tokenizer locally:
trainer.model.save_pretrained("./gemma4-emotion-lora")
processor.save_pretrained("./gemma4-emotion-lora")
Finally, we can push the model to the Hugging Face Hub:
repo_id = "your-username/gemma4-emotion-lora"
trainer.model.push_to_hub(repo_id, private=False)
processor.push_to_hub(repo_id, private=False)
7. Evaluate the Fine-Tuned Model
Now that training is complete, the final step is to evaluate the fine-tuned model on the same test split and compare the results with the base model. This helps us see whether LoRA fine-tuning improved the model's ability to classify emotions more accurately.
ft_model = trainer.model
ft_model.eval()
ft_model.config.use_cache = True
post_metrics, post_preds = evaluate_model(ft_model, processor, "test")
print(post_metrics)
These results are clearly stronger than the baseline. After fine-tuning, the model reaches 77.25% accuracy and a macro F1 score of 0.698. The number of invalid predictions also drops from 33 to 20.
| Metric | Pre-Fine-Tuning | Post-Fine-Tuning | Improvement |
|---|---|---|---|
| Accuracy | 58.25% | 77.25% | +19% |
| Macro F1 | 0.42 | 0.70 | +0.28 |
| Invalid Predictions | 33 | 20 | -13 |
Final Thoughts
Fine-tuning Gemma 4 is very sensitive to setup, especially the prompt structure and training arguments. If the prompt format is wrong, or you do not use the proper template consistently, the model may go through training without actually learning the task well.
Another important lesson is max_length. If you reduce it too much, especially below around 125, the model may not learn the pattern properly at all. Most issues come back to the same two areas: prompt formatting and training configuration.
To improve the results further, a good next step would be to fine-tune on the full dataset and train for at least 3 epochs instead of just one. That would give the model more examples to learn from and more time to adapt, which should lead to stronger accuracy and F1 scores.
Important Note
If you run into any issues while running the code, you can refer to the full Jupyter notebook on Hugging Face for complete reference.
Frequently Asked Questions
What is Gemma 4 E4B-it?
Gemma 4 E4B-it is Google's most intelligent open model family, built for strong reasoning and agentic workflows. It comes in various sizes and is designed for flexibility across environments.
Why use 4-bit quantization for fine-tuning?
4-bit quantization reduces model size by approximately 4x, enabling large models like Gemma 4 to run on consumer GPUs like the 3090 with limited VRAM while maintaining reasonable accuracy.
What is LoRA fine-tuning?
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that attaches small trainable adapter weights to the base model instead of updating all parameters, making training much lighter.
How long does fine-tuning take?
On a single 3090 GPU with 4-bit quantization, fine-tuning for one epoch takes approximately 9 minutes with the dataset configuration used in this tutorial.
Can I improve results further?
Yes, you can improve results by using the full dataset instead of limited splits, training for 2-3 epochs instead of one, and experimenting with LoRA rank and learning rate settings.
Need Help with AI Model Fine-Tuning?
Our experts can help you configure and fine-tune Gemma 4 and other large language models for your specific use cases.
