Google has introduced Gemma 4, describing it as its most intelligent open model family so far, built for strong reasoning and agentic workflows. Gemma models are designed to be flexible across environments, with official support and tooling for local development, cloud deployment, and model customization, which makes them a strong choice for fine-tuning projects.

In this complete tutorial, we will fine-tune Gemma 4 E4B-it on a human emotion classification dataset from Hugging Face. We will set up a 3090 GPU environment, load and inspect the dataset, prepare and format the data for supervised fine-tuning, load the base model, run baseline evaluation before training, fine-tune the model, and then evaluate its performance again after training.

What You'll Learn:

Setting up a RunPod environment with 3090 GPU
Loading and preparing the emotion classification dataset
Formatting data for Gemma 4 fine-tuning with chat templates
Loading Gemma 4 with 4-bit quantization
Evaluating baseline model performance
Fine-tuning with LoRA adapters
Evaluating and comparing post-fine-tuning results

1. Setting Up the Environment

Start by launching a new RunPod instance, and make sure your account has at least $5 in credit before you begin. For this tutorial, choose a 3090 GPU pod and select the latest PyTorch template.

Before deploying, open the template settings and make a few updates. Increase both the container disk and volume disk to 40 GB so you have enough space for the model, dataset, cached files, and training checkpoints.

You should also add your Hugging Face token as an environment variable. You can generate this token from Settings > Access Tokens in your Hugging Face account.

Deploy RunPod Instance

Choose a 3090 GPU pod with the latest PyTorch template and deploy.

Hugging Face Token Required

You need a Hugging Face token with access to the gated gemma-4-E4B-it model. Generate one from Settings > Access Tokens.

Once these settings are in place, go ahead and deploy the pod. It may take a minute or two for the instance to start. After it is ready, open the JupyterLab interface so you can begin working inside the environment.

The first thing to do in JupyterLab is launch the new Python notebook and install all the required Python packages. Run the following command in a notebook cell:

pip install

%%capture
!pip install -U transformers accelerate datasets trl peft bitsandbytes scikit-learn huggingface_hub

These packages will cover the full workflow, including loading the dataset, preparing the model, fine-tuning, and evaluation.

The last step is to sign in to the Hugging Face Hub using your saved token. This gives you access to the gated model and also makes it easier to upload files, create repositories, and push your fine-tuned model later.

Hugging Face Login

import os
from huggingface_hub import login

hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
    raise ValueError("Set HF_TOKEN in the RunPod environment before running this notebook.")

login(token=hf_token)
print("Logged in to Hugging Face.")

2. Load and Prepare the Emotion Dataset

Now that the environment is ready, the next step is to load the emotion dataset from Hugging Face and prepare smaller splits for training and evaluation.

For this tutorial, we are not using the full dataset. Instead, we create limited train, validation, and test splits so the fine-tuning process stays faster and easier to run on a single GPU.

Load Dataset

from datasets import load_dataset, DatasetDict

TRAIN_LIMIT = 4000    
VALIDATION_LIMIT = 400  
TEST_LIMIT = 400          
EVAL_LIMIT = 400        

raw_dataset = load_dataset("dair-ai/emotion")

def maybe_limit(split, limit):
    split = split.shuffle(seed=42)
    if limit is None:
        return split
    return split.select(range(min(limit, len(split))))

dataset = DatasetDict({
    "train": maybe_limit(raw_dataset["train"], TRAIN_LIMIT),
    "validation": maybe_limit(raw_dataset["validation"], VALIDATION_LIMIT),
    "test": maybe_limit(raw_dataset["test"], TEST_LIMIT),
})

dataset

The final dataset contains 4,000 training examples, 400 validation examples, and 400 test examples.

Split	Examples
Train	4,000
Validation	400
Test	400

Next, we look at the label names stored in the dataset. These are the emotion classes the model will learn to predict.

Label Names

label_names = dataset["train"].features["label"].names
label_names

This shows that the task has six emotion categories: sadness, joy, love, anger, fear, and surprise.

Label ID	Emotion
0	sadness
1	joy
2	love
3	anger
4	fear
5	surprise

3. Formatting Data for Gemma 4 Fine-Tuning

Before we can fine-tune the model, we need to convert the dataset into the format Gemma 4 will use during training.

Instead of passing only raw text and labels, we structure each example as a short chat interaction with a system message, a user message, and the expected assistant response.

The system prompt tells the model exactly what task it should perform. In this case, we want the model to act as an emotion classification assistant and return only one of the six allowed labels.

System Prompt

SYSTEM_PROMPT = """You are an emotion classification assistant.
Read the user's text and answer with exactly one label.
Only choose from: sadness, joy, love, anger, fear, surprise.
Return only the label and nothing else."""

Now we create a function to format the data into the prompt-completion format required for supervised fine-tuning:

Format Function

def to_prompt_completion(example):
    text = example["text"]
    label = label_names[example["label"]]
    return {
        "prompt": [
            {
                "role": "system",
                "content": SYSTEM_PROMPT,
            },
            {
                "role": "user",
                "content": f"Classify the emotion of this text:

{text}",
            },
        ],
        "completion": [
            {
                "role": "assistant",
                "content": label,
            }
        ],
    }

sft_dataset = dataset.map(to_prompt_completion, remove_columns=dataset["train"].column_names)

4. Load Gemma E4B-it With 4-Bit Quantization

Now we can load Gemma 4 E4B-it and prepare it for fine-tuning. Since this is a relatively large model, we load it with 4-bit quantization to reduce memory usage and make it easier to run on a 3090 GPU. We also use bfloat16 as the compute type, which helps keep the setup efficient.

4-Bit Quantization

Reduces model size by 4x, enabling large models to run on consumer GPUs with minimal accuracy loss.

bfloat16 Precision

Modern floating point format that balances precision and performance for deep learning training.

We start by importing the required libraries and defining the main model settings:

Import Libraries

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

MODEL_ID = "google/gemma-4-E4B-it"
MODEL_DTYPE = torch.bfloat16
USE_4BIT = True

Next, we enable CUDA optimizations and load the tokenizer:

Load Tokenizer

if torch.cuda.is_available():
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True

processor = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if processor.pad_token is None:
    processor.pad_token = processor.eos_token

Now we prepare the quantization settings and model loading arguments:

Quantization Config

bnb_config = None
model_kwargs = {
    "device_map": "auto",
}
if USE_4BIT:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=MODEL_DTYPE,
    )
    model_kwargs["quantization_config"] = bnb_config
else:
    model_kwargs["torch_dtype"] = MODEL_DTYPE

Finally, we load the model and align its configuration with the tokenizer:

Load Model

base_model = AutoModelForCausalLM.from_pretrained(MODEL_ID, **model_kwargs)
base_model.config.use_cache = False
base_model.config.pad_token_id = processor.pad_token_id
base_model.config.bos_token_id = processor.bos_token_id
base_model.config.eos_token_id = processor.eos_token_id
base_model.generation_config.pad_token_id = processor.pad_token_id
base_model.generation_config.bos_token_id = processor.bos_token_id
base_model.generation_config.eos_token_id = processor.eos_token_id

print(f"Base model loaded with 4-bit={USE_4BIT} and dtype={MODEL_DTYPE}.")

5. Evaluate the Base Model

Before fine-tuning, it is useful to evaluate the base model first so we have a clear baseline to compare against later.

In this section, we define a few helper functions that generate predictions, extract valid emotion labels, and run evaluation on the test split.

Evaluation Functions

import re

LABEL_PATTERN = re.compile(r"(sadness|joy|love|anger|fear|surprise)", re.IGNORECASE)

def extract_label(raw_text: str) -> str:
    raw_text = raw_text.strip().lower()
    match = LABEL_PATTERN.search(raw_text)
    if match:
        return match.group(1)
    first_token = raw_text.split()[0].strip(".,!?:;"'()[]{}") if raw_text.split() else ""
    return first_token

def generate_label(model, processor, user_text, system_prompt, max_new_tokens=4):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Classify the emotion of this text:

{user_text}"},
    ]
    device = next(model.parameters()).device
    inputs = processor.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True,
        return_dict=True, return_tensors="pt",
    ).to(device)
    input_len = inputs["input_ids"].shape[-1]
    with torch.no_grad():
        outputs = model.generate(
            **inputs, max_new_tokens=max_new_tokens, do_sample=False,
            pad_token_id=processor.pad_token_id, eos_token_id=processor.eos_token_id,
        )
    raw_pred = processor.decode(outputs[0][input_len:], skip_special_tokens=True).strip()
    return extract_label(raw_pred)

Now we run the baseline evaluation on the test split:

Baseline Evaluation

from sklearn.metrics import accuracy_score, f1_score
import pandas as pd
from tqdm.auto import tqdm

VALID_LABELS = set(label_names)

def evaluate_model(model, processor, split="test", limit=EVAL_LIMIT):
    y_true, y_pred, rows = [], [], []
    raw_source = dataset[split]
    if limit is not None:
        raw_source = raw_source.select(range(min(limit, len(raw_source))))
    model.eval()
    for ex in tqdm(raw_source, desc=f"Evaluating {split}", leave=False):
        true_label = label_names[ex["label"]]
        raw_pred_label = generate_label(model, processor, ex["text"], SYSTEM_PROMPT)
        pred_label = raw_pred_label if raw_pred_label in VALID_LABELS else "INVALID"
        y_true.append(true_label)
        y_pred.append(pred_label)
        rows.append({"text": ex["text"], "true_label": true_label, "pred_label": pred_label, "correct": true_label == pred_label})
    metrics = {
        "accuracy": accuracy_score(y_true, y_pred),
        "macro_f1": f1_score(y_true, y_pred, labels=label_names, average="macro", zero_division=0),
        "invalid_predictions": sum(1 for p in y_pred if p == "INVALID"),
        "evaluated_examples": len(y_true),
    }
    df = pd.DataFrame(rows)
    return metrics, df

pre_metrics, pre_preds = evaluate_model(base_model, processor, "test")
print(pre_metrics)

These baseline results show that the untuned model already performs reasonably well, but there is still room for improvement. The accuracy is around 58.25%, the macro F1 score is around 0.42, and the model produced 33 invalid predictions.

Metric	Pre-Fine-Tuning
Accuracy	58.25%
Macro F1	0.42
Invalid Predictions	33

6. Fine-Tune Gemma 4 With LoRA

Now that we have the baseline results, we can fine-tune Gemma 4 using LoRA. LoRA is a parameter-efficient fine-tuning method, which means we do not update the full model. Instead, we attach a small number of trainable adapter weights on top of the base model. This makes training much lighter and more practical on a single GPU.

Configure LoRA

Define LoRA settings including rank 16, alpha 32, and dropout 0.05.

Set Up Trainer

Configure training arguments including batch size, learning rate, and epochs.

Train and Save

Run training for one epoch and save the fine-tuned adapter.

We start by defining the LoRA configuration:

LoRA Config

from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear"
)

Next, we define the training configuration and set up the trainer:

Training Arguments

from trl import SFTConfig, SFTTrainer

training_args = SFTConfig(
    output_dir="./gemma4-emotion-lora",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-4,
    weight_decay=0.01,
    lr_scheduler_type="linear",
    warmup_steps=50,
    num_train_epochs=1,
    logging_steps=50,
    eval_strategy="steps",
    gradient_checkpointing=True,
    bf16=True,
    fp16=False,
    tf32=True,
    max_length=256,
    packing=False,
    completion_only_loss=True,
    remove_unused_columns=False,
    dataloader_num_workers=2,
    optim="paged_adamw_8bit",
    report_to="none",
)

Now we initialize the trainer with the LoRA configuration:

Initialize Trainer

from peft import PeftModel

if isinstance(base_model, PeftModel):
    base_model = base_model.unload()
    base_model.config.use_cache = False

trainer = SFTTrainer(
    model=base_model,
    train_dataset=sft_dataset["train"],
    eval_dataset=sft_dataset["validation"],
    peft_config=lora_config,
    args=training_args,
    processing_class=processor,
)

Now we can start training. The training typically takes about 9 minutes on a 3090 GPU:

Start Training

trainable_params = 0
for param in trainer.model.parameters():
    if param.requires_grad:
        trainable_params += param.numel()

print(f"Trainable LoRA parameters: {trainable_params:,}")
train_result = trainer.train()
trainer.model.eval()
trainer.model.config.use_cache = True

Once training is complete, we can save the adapter and tokenizer locally:

Save Model

trainer.model.save_pretrained("./gemma4-emotion-lora")
processor.save_pretrained("./gemma4-emotion-lora")

Finally, we can push the model to the Hugging Face Hub:

Push to Hub

repo_id = "your-username/gemma4-emotion-lora"

trainer.model.push_to_hub(repo_id, private=False)
processor.push_to_hub(repo_id, private=False)

7. Evaluate the Fine-Tuned Model

Now that training is complete, the final step is to evaluate the fine-tuned model on the same test split and compare the results with the base model. This helps us see whether LoRA fine-tuning improved the model's ability to classify emotions more accurately.

Post-Fine-Tuning Evaluation

ft_model = trainer.model
ft_model.eval()
ft_model.config.use_cache = True
post_metrics, post_preds = evaluate_model(ft_model, processor, "test")
print(post_metrics)

These results are clearly stronger than the baseline. After fine-tuning, the model reaches 77.25% accuracy and a macro F1 score of 0.698. The number of invalid predictions also drops from 33 to 20.

Metric	Pre-Fine-Tuning	Post-Fine-Tuning	Improvement
Accuracy	58.25%	77.25%	+19%
Macro F1	0.42	0.70	+0.28
Invalid Predictions	33	20	-13

Final Thoughts

Fine-tuning Gemma 4 is very sensitive to setup, especially the prompt structure and training arguments. If the prompt format is wrong, or you do not use the proper template consistently, the model may go through training without actually learning the task well.

Another important lesson is max_length. If you reduce it too much, especially below around 125, the model may not learn the pattern properly at all. Most issues come back to the same two areas: prompt formatting and training configuration.

To improve the results further, a good next step would be to fine-tune on the full dataset and train for at least 3 epochs instead of just one. That would give the model more examples to learn from and more time to adapt, which should lead to stronger accuracy and F1 scores.

Important Note

If you run into any issues while running the code, you can refer to the full Jupyter notebook on Hugging Face for complete reference.

Frequently Asked Questions

What is Gemma 4 E4B-it?

Gemma 4 E4B-it is Google's most intelligent open model family, built for strong reasoning and agentic workflows. It comes in various sizes and is designed for flexibility across environments.

Why use 4-bit quantization for fine-tuning?

4-bit quantization reduces model size by approximately 4x, enabling large models like Gemma 4 to run on consumer GPUs like the 3090 with limited VRAM while maintaining reasonable accuracy.

What is LoRA fine-tuning?

LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning method that attaches small trainable adapter weights to the base model instead of updating all parameters, making training much lighter.

How long does fine-tuning take?

On a single 3090 GPU with 4-bit quantization, fine-tuning for one epoch takes approximately 9 minutes with the dataset configuration used in this tutorial.

Can I improve results further?

Yes, you can improve results by using the full dataset instead of limited splits, training for 2-3 epochs instead of one, and experimenting with LoRA rank and learning rate settings.

Need Help with AI Model Fine-Tuning?

Our experts can help you configure and fine-tune Gemma 4 and other large language models for your specific use cases.

What You'll Learn:

Setting up a RunPod environment with 3090 GPU
Loading and preparing the emotion classification dataset
Formatting data for Gemma 4 fine-tuning with chat templates
Loading Gemma 4 with 4-bit quantization
Evaluating baseline model performance
Fine-tuning with LoRA adapters
Evaluating and comparing post-fine-tuning results

1. Setting Up the Environment

Start by launching a new RunPod instance, and make sure your account has at least $5 in credit before you begin. For this tutorial, choose a 3090 GPU pod and select the latest PyTorch template.

You should also add your Hugging Face token as an environment variable. You can generate this token from Settings > Access Tokens in your Hugging Face account.

Deploy RunPod Instance

Choose a 3090 GPU pod with the latest PyTorch template and deploy.

Hugging Face Token Required

You need a Hugging Face token with access to the gated gemma-4-E4B-it model. Generate one from Settings > Access Tokens.

The first thing to do in JupyterLab is launch the new Python notebook and install all the required Python packages. Run the following command in a notebook cell:

pip install

%%capture
!pip install -U transformers accelerate datasets trl peft bitsandbytes scikit-learn huggingface_hub

These packages will cover the full workflow, including loading the dataset, preparing the model, fine-tuning, and evaluation.

Hugging Face Login

import os
from huggingface_hub import login

hf_token = os.environ.get("HF_TOKEN")
if not hf_token:
    raise ValueError("Set HF_TOKEN in the RunPod environment before running this notebook.")

login(token=hf_token)
print("Logged in to Hugging Face.")

2. Load and Prepare the Emotion Dataset

Now that the environment is ready, the next step is to load the emotion dataset from Hugging Face and prepare smaller splits for training and evaluation.

For this tutorial, we are not using the full dataset. Instead, we create limited train, validation, and test splits so the fine-tuning process stays faster and easier to run on a single GPU.

Load Dataset

from datasets import load_dataset, DatasetDict

TRAIN_LIMIT = 4000    
VALIDATION_LIMIT = 400  
TEST_LIMIT = 400          
EVAL_LIMIT = 400        

raw_dataset = load_dataset("dair-ai/emotion")

def maybe_limit(split, limit):
    split = split.shuffle(seed=42)
    if limit is None:
        return split
    return split.select(range(min(limit, len(split))))

dataset = DatasetDict({
    "train": maybe_limit(raw_dataset["train"], TRAIN_LIMIT),
    "validation": maybe_limit(raw_dataset["validation"], VALIDATION_LIMIT),
    "test": maybe_limit(raw_dataset["test"], TEST_LIMIT),
})

dataset

The final dataset contains 4,000 training examples, 400 validation examples, and 400 test examples.

Split	Examples
Train	4,000
Validation	400
Test	400

Next, we look at the label names stored in the dataset. These are the emotion classes the model will learn to predict.

Label Names

label_names = dataset["train"].features["label"].names
label_names

This shows that the task has six emotion categories: sadness, joy, love, anger, fear, and surprise.

Label ID	Emotion
0	sadness
1	joy
2	love
3	anger
4	fear
5	surprise

3. Formatting Data for Gemma 4 Fine-Tuning

Before we can fine-tune the model, we need to convert the dataset into the format Gemma 4 will use during training.

Instead of passing only raw text and labels, we structure each example as a short chat interaction with a system message, a user message, and the expected assistant response.

The system prompt tells the model exactly what task it should perform. In this case, we want the model to act as an emotion classification assistant and return only one of the six allowed labels.

System Prompt

SYSTEM_PROMPT = """You are an emotion classification assistant.
Read the user's text and answer with exactly one label.
Only choose from: sadness, joy, love, anger, fear, surprise.
Return only the label and nothing else."""

Now we create a function to format the data into the prompt-completion format required for supervised fine-tuning:

Format Function

def to_prompt_completion(example):
    text = example["text"]
    label = label_names[example["label"]]
    return {
        "prompt": [
            {
                "role": "system",
                "content": SYSTEM_PROMPT,
            },
            {
                "role": "user",
                "content": f"Classify the emotion of this text:

{text}",
            },
        ],
        "completion": [
            {
                "role": "assistant",
                "content": label,
            }
        ],
    }

sft_dataset = dataset.map(to_prompt_completion, remove_columns=dataset["train"].column_names)

4. Load Gemma E4B-it With 4-Bit Quantization

4-Bit Quantization

Reduces model size by 4x, enabling large models to run on consumer GPUs with minimal accuracy loss.

bfloat16 Precision

Modern floating point format that balances precision and performance for deep learning training.

We start by importing the required libraries and defining the main model settings:

Import Libraries

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

MODEL_ID = "google/gemma-4-E4B-it"
MODEL_DTYPE = torch.bfloat16
USE_4BIT = True

Next, we enable CUDA optimizations and load the tokenizer:

Load Tokenizer

if torch.cuda.is_available():
    torch.backends.cuda.matmul.allow_tf32 = True
    torch.backends.cudnn.allow_tf32 = True

processor = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if processor.pad_token is None:
    processor.pad_token = processor.eos_token

Now we prepare the quantization settings and model loading arguments:

Quantization Config

bnb_config = None
model_kwargs = {
    "device_map": "auto",
}
if USE_4BIT:
    bnb_config = BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=MODEL_DTYPE,
    )
    model_kwargs["quantization_config"] = bnb_config
else:
    model_kwargs["torch_dtype"] = MODEL_DTYPE

Finally, we load the model and align its configuration with the tokenizer:

Load Model

base_model = AutoModelForCausalLM.from_pretrained(MODEL_ID, **model_kwargs)
base_model.config.use_cache = False
base_model.config.pad_token_id = processor.pad_token_id
base_model.config.bos_token_id = processor.bos_token_id
base_model.config.eos_token_id = processor.eos_token_id
base_model.generation_config.pad_token_id = processor.pad_token_id
base_model.generation_config.bos_token_id = processor.bos_token_id
base_model.generation_config.eos_token_id = processor.eos_token_id

print(f"Base model loaded with 4-bit={USE_4BIT} and dtype={MODEL_DTYPE}.")

5. Evaluate the Base Model

Before fine-tuning, it is useful to evaluate the base model first so we have a clear baseline to compare against later.

In this section, we define a few helper functions that generate predictions, extract valid emotion labels, and run evaluation on the test split.

Evaluation Functions

import re

LABEL_PATTERN = re.compile(r"(sadness|joy|love|anger|fear|surprise)", re.IGNORECASE)

def extract_label(raw_text: str) -> str:
    raw_text = raw_text.strip().lower()
    match = LABEL_PATTERN.search(raw_text)
    if match:
        return match.group(1)
    first_token = raw_text.split()[0].strip(".,!?:;"'()[]{}") if raw_text.split() else ""
    return first_token

def generate_label(model, processor, user_text, system_prompt, max_new_tokens=4):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Classify the emotion of this text:

{user_text}"},
    ]
    device = next(model.parameters()).device
    inputs = processor.apply_chat_template(
        messages, tokenize=True, add_generation_prompt=True,
        return_dict=True, return_tensors="pt",
    ).to(device)
    input_len = inputs["input_ids"].shape[-1]
    with torch.no_grad():
        outputs = model.generate(
            **inputs, max_new_tokens=max_new_tokens, do_sample=False,
            pad_token_id=processor.pad_token_id, eos_token_id=processor.eos_token_id,
        )
    raw_pred = processor.decode(outputs[0][input_len:], skip_special_tokens=True).strip()
    return extract_label(raw_pred)

Now we run the baseline evaluation on the test split:

Baseline Evaluation

from sklearn.metrics import accuracy_score, f1_score
import pandas as pd
from tqdm.auto import tqdm

VALID_LABELS = set(label_names)

def evaluate_model(model, processor, split="test", limit=EVAL_LIMIT):
    y_true, y_pred, rows = [], [], []
    raw_source = dataset[split]
    if limit is not None:
        raw_source = raw_source.select(range(min(limit, len(raw_source))))
    model.eval()
    for ex in tqdm(raw_source, desc=f"Evaluating {split}", leave=False):
        true_label = label_names[ex["label"]]
        raw_pred_label = generate_label(model, processor, ex["text"], SYSTEM_PROMPT)
        pred_label = raw_pred_label if raw_pred_label in VALID_LABELS else "INVALID"
        y_true.append(true_label)
        y_pred.append(pred_label)
        rows.append({"text": ex["text"], "true_label": true_label, "pred_label": pred_label, "correct": true_label == pred_label})
    metrics = {
        "accuracy": accuracy_score(y_true, y_pred),
        "macro_f1": f1_score(y_true, y_pred, labels=label_names, average="macro", zero_division=0),
        "invalid_predictions": sum(1 for p in y_pred if p == "INVALID"),
        "evaluated_examples": len(y_true),
    }
    df = pd.DataFrame(rows)
    return metrics, df

pre_metrics, pre_preds = evaluate_model(base_model, processor, "test")
print(pre_metrics)

Metric	Pre-Fine-Tuning
Accuracy	58.25%
Macro F1	0.42
Invalid Predictions	33

6. Fine-Tune Gemma 4 With LoRA

Configure LoRA

Define LoRA settings including rank 16, alpha 32, and dropout 0.05.

Set Up Trainer

Configure training arguments including batch size, learning rate, and epochs.

Train and Save

Run training for one epoch and save the fine-tuned adapter.

We start by defining the LoRA configuration:

LoRA Config

from peft import LoraConfig

lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules="all-linear"
)

Next, we define the training configuration and set up the trainer:

Training Arguments

from trl import SFTConfig, SFTTrainer

training_args = SFTConfig(
    output_dir="./gemma4-emotion-lora",
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    gradient_accumulation_steps=2,
    learning_rate=1e-4,
    weight_decay=0.01,
    lr_scheduler_type="linear",
    warmup_steps=50,
    num_train_epochs=1,
    logging_steps=50,
    eval_strategy="steps",
    gradient_checkpointing=True,
    bf16=True,
    fp16=False,
    tf32=True,
    max_length=256,
    packing=False,
    completion_only_loss=True,
    remove_unused_columns=False,
    dataloader_num_workers=2,
    optim="paged_adamw_8bit",
    report_to="none",
)

Now we initialize the trainer with the LoRA configuration:

Initialize Trainer

from peft import PeftModel

if isinstance(base_model, PeftModel):
    base_model = base_model.unload()
    base_model.config.use_cache = False

trainer = SFTTrainer(
    model=base_model,
    train_dataset=sft_dataset["train"],
    eval_dataset=sft_dataset["validation"],
    peft_config=lora_config,
    args=training_args,
    processing_class=processor,
)

Now we can start training. The training typically takes about 9 minutes on a 3090 GPU:

Start Training

trainable_params = 0
for param in trainer.model.parameters():
    if param.requires_grad:
        trainable_params += param.numel()

print(f"Trainable LoRA parameters: {trainable_params:,}")
train_result = trainer.train()
trainer.model.eval()
trainer.model.config.use_cache = True

Once training is complete, we can save the adapter and tokenizer locally:

Save Model

trainer.model.save_pretrained("./gemma4-emotion-lora")
processor.save_pretrained("./gemma4-emotion-lora")

Finally, we can push the model to the Hugging Face Hub:

Push to Hub

repo_id = "your-username/gemma4-emotion-lora"

trainer.model.push_to_hub(repo_id, private=False)
processor.push_to_hub(repo_id, private=False)

7. Evaluate the Fine-Tuned Model

Post-Fine-Tuning Evaluation

ft_model = trainer.model
ft_model.eval()
ft_model.config.use_cache = True
post_metrics, post_preds = evaluate_model(ft_model, processor, "test")
print(post_metrics)

These results are clearly stronger than the baseline. After fine-tuning, the model reaches 77.25% accuracy and a macro F1 score of 0.698. The number of invalid predictions also drops from 33 to 20.

Metric	Pre-Fine-Tuning	Post-Fine-Tuning	Improvement
Accuracy	58.25%	77.25%	+19%
Macro F1	0.42	0.70	+0.28
Invalid Predictions	33	20	-13

Final Thoughts

Important Note

If you run into any issues while running the code, you can refer to the full Jupyter notebook on Hugging Face for complete reference.

Frequently Asked Questions

What is Gemma 4 E4B-it?

Gemma 4 E4B-it is Google's most intelligent open model family, built for strong reasoning and agentic workflows. It comes in various sizes and is designed for flexibility across environments.

Why use 4-bit quantization for fine-tuning?

4-bit quantization reduces model size by approximately 4x, enabling large models like Gemma 4 to run on consumer GPUs like the 3090 with limited VRAM while maintaining reasonable accuracy.

What is LoRA fine-tuning?

How long does fine-tuning take?

On a single 3090 GPU with 4-bit quantization, fine-tuning for one epoch takes approximately 9 minutes with the dataset configuration used in this tutorial.

Can I improve results further?

Yes, you can improve results by using the full dataset instead of limited splits, training for 2-3 epochs instead of one, and experimenting with LoRA rank and learning rate settings.

Need Help with AI Model Fine-Tuning?

Our experts can help you configure and fine-tune Gemma 4 and other large language models for your specific use cases.

How to Fine-Tune Gemma 4: Complete Step by Step Guide

1. Setting Up the Environment

Deploy RunPod Instance

2. Load and Prepare the Emotion Dataset

3. Formatting Data for Gemma 4 Fine-Tuning

4. Load Gemma E4B-it With 4-Bit Quantization

4-Bit Quantization

bfloat16 Precision

5. Evaluate the Base Model

6. Fine-Tune Gemma 4 With LoRA

Configure LoRA

Set Up Trainer

Train and Save

7. Evaluate the Fine-Tuned Model

Final Thoughts

Frequently Asked Questions

What is Gemma 4 E4B-it?

Why use 4-bit quantization for fine-tuning?

What is LoRA fine-tuning?

How long does fine-tuning take?

Can I improve results further?

Need Help with AI Model Fine-Tuning?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief

How to Fine-Tune Gemma 4: Complete Step by Step Guide

1. Setting Up the Environment

Deploy RunPod Instance

2. Load and Prepare the Emotion Dataset

3. Formatting Data for Gemma 4 Fine-Tuning

4. Load Gemma E4B-it With 4-Bit Quantization

4-Bit Quantization

bfloat16 Precision

5. Evaluate the Base Model

6. Fine-Tune Gemma 4 With LoRA

Configure LoRA

Set Up Trainer

Train and Save

7. Evaluate the Fine-Tuned Model

Final Thoughts

Frequently Asked Questions

What is Gemma 4 E4B-it?

Why use 4-bit quantization for fine-tuning?

What is LoRA fine-tuning?

How long does fine-tuning take?

Can I improve results further?

Need Help with AI Model Fine-Tuning?

Need this implemented in your project?

Take the guide with you

Book a 30-min architecture call

Get a free 48-hour written brief