How to Use Gradient Clipping to Prevent Exploding Gradients
Gradient clipping is the standard solution for preventing exploding gradients in deep neural networks, and this complete tutorial gives you a full step by step guide from theory to production-ready code. Whether you are training recurrent neural networks, transformers, or deep feedforward architectures, a single one-line addition to your training loop can be the difference between a stable run and a loss that spikes to NaN. This beginner guide explains exactly how gradient clipping works, compares clip-by-norm and clip-by-value strategies, and shows you how to implement both in PyTorch and TensorFlow with practical, copy-ready code examples.
What You'll Learn:
- Why exploding gradients occur during backpropagation and which architectures are most vulnerable
- How gradient clipping acts as a safety mechanism between loss.backward() and optimizer.step()
- The difference between clip-by-norm (preserves direction) and clip-by-value (clamps elements)
- How to implement torch.nn.utils.clip_grad_norm_ in a PyTorch training loop
- How to use clipnorm and global_clipnorm in a TensorFlow/Keras optimizer
- How to choose the right clipping threshold and monitor gradient norms during training
- How gradient clipping differs from solutions for vanishing gradients and when to use each
What Are Exploding Gradients and Why Do They Happen?
During training, backpropagation computes the gradient of the loss with respect to every parameter in the network by applying the chain rule layer by layer. In most well-behaved networks this produces gradient values that are small and numerically stable — the optimizer steps a modest distance in parameter space and the loss decreases smoothly. Exploding gradients occur when this chain of multiplications produces values that grow without bound, resulting in parameter updates so large that the loss function spikes violently or produces NaN (not a number) and Inf (infinity) values.
The root cause is that the chain rule multiplies together Jacobian matrices across all layers. When those matrices have eigenvalues greater than one, successive multiplication causes exponential growth. Recurrent neural networks are the canonical example: backpropagation through time (BPTT) unrolls the network across every time step and multiplies the same weight matrix against itself at each step. Even a modest eigenvalue of 1.1 raised to the 100th power (for a 100-step sequence) produces a value of over 13,000. The gradient for the earliest time steps in the sequence therefore dwarfs those for recent steps, causing catastrophic parameter updates.
Transformer architectures are also susceptible, particularly during the early stages of training before layer normalization and learning rate warmup schedules have had time to stabilize the optimization landscape. Very deep feedforward networks face the same risk when weight initialization is poorly chosen. The symptoms are consistent across all architectures: loss values that were decreasing cleanly suddenly spike to extremely large numbers or NaN, gradients reported in monitoring logs show magnitudes in the thousands or millions, and training either diverges permanently or produces a model with near-random behaviour.
Clip by Norm (Standard)
Computes the L2 norm of the entire gradient vector and scales every gradient down proportionally if that norm exceeds a threshold. Preserves the direction of the gradient update while shortening its magnitude — the optimizer still moves in the correct direction, just not as far. This is the universally preferred method and the default in Hugging Face, PyTorch Lightning, and DeepSpeed.
Clip by Value
Clamps every individual gradient element to a fixed range such as [-1.0, 1.0] using a simple min/max operation. Fast to compute and straightforward to reason about. The significant drawback is that clipping individual components independently changes the direction of the gradient vector when some elements are clipped and others are not, potentially misaligning the optimizer's step with the true loss surface.
Training Stability
Gradient clipping acts as a safety mechanism inserted between the backpropagation step and the optimizer update step. It does not change the model architecture or the loss function — it simply bounds the maximum step size the optimizer is allowed to take in parameter space. This makes it compatible with every optimizer (Adam, SGD, AdamW, RMSprop) and every network architecture without any modification to the forward pass.
One-Line Implementation
In PyTorch, the entire clip-by-norm operation is a single function call — torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) — placed immediately after loss.backward() and before optimizer.step(). In TensorFlow/Keras, it is a single keyword argument on the optimizer constructor. Both frameworks handle the norm computation, threshold check, and in-place rescaling automatically.
How Gradient Clipping Works: The Four-Step Mechanism
Understanding the internal mechanism of gradient clipping helps you reason about when it is active, what effect it has on convergence, and how to set the threshold appropriately. The process inserts a bounded transformation step between standard backpropagation and the optimizer parameter update.
The four steps below describe clip-by-norm, which is the standard approach. Clip-by-value follows the same structure but replaces the norm-based scaling in step three with an element-wise clamp operation.
Understand Exploding Gradients and When Clipping Is Needed
Before adding gradient clipping to your training loop, confirm that exploding gradients are actually the problem. The clearest signal is a loss that was decreasing normally and suddenly spikes to a very large number or produces NaN/Inf. In PyTorch you can print the pre-clip gradient norm using torch.nn.utils.clip_grad_norm_ (which returns the norm before clipping) or iterate for p in model.parameters(): print(p.grad.norm()) to see individual layer norms. If you see gradient norms in the hundreds or thousands, exploding gradients are confirmed. Gradient clipping is particularly important for RNNs, LSTMs, GRUs, and transformers but can be applied beneficially to any architecture as a safety net. You do not need to observe an explosion to add clipping — many practitioners add it proactively to all RNN and transformer training runs.
Choose Your Clipping Method: Clip by Norm or Clip by Value
Select clip-by-norm unless you have a specific reason not to. It is the universally recommended approach because it preserves the direction of the gradient vector — only the step magnitude is reduced, not the relative contribution of individual parameters. The formula is: if the global L2 norm ||g|| exceeds the threshold c, multiply every gradient element by c / ||g||. This scales the entire gradient vector uniformly, keeping all parameter updates proportionally consistent. Use clip-by-value only for specific experimental scenarios where you want hard element-wise bounds regardless of direction — for example, certain reinforcement learning policy gradient methods. For standard supervised learning with RNNs or transformers, clip-by-norm is the correct default.
Set the Right Clipping Threshold for Your Architecture
The threshold (called max_norm in PyTorch and clipnorm in TensorFlow) controls when clipping activates. A starting value of 1.0 works well for the majority of architectures and is the default used by Hugging Face Transformers, PyTorch Lightning, and DeepSpeed. The practical range is 0.5 to 5.0. To calibrate for your specific model, run several training steps with clipping disabled and log the gradient norm at each step. Identify the typical (median) norm during stable training and the peak norm during any instability events. Set your threshold above the typical median — so clipping does not slow down stable training — but below the problematic peak values. A threshold that is too high never activates and provides no protection. A threshold that is too low clips during every step, effectively reducing your learning rate and slowing convergence. For language model fine-tuning, 1.0 is nearly always correct. For custom RNN architectures trained from scratch, experiment in the 0.5 to 2.0 range.
Implement Gradient Clipping in PyTorch or TensorFlow
In PyTorch, insert torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) in your training loop after loss.backward() and before optimizer.step(). This is a single line that computes the global gradient norm, applies the scaling if needed, and modifies gradients in-place. The function also returns the pre-clipping norm as a float, which you can log for monitoring. In TensorFlow/Keras, pass clipnorm=1.0 as a keyword argument to your optimizer constructor — the clipping is then applied automatically to every gradient before every parameter update. No changes to the model definition or the training loop body are required. Both implementations are covered in detail with full code examples in the sections below.
Monitor Gradient Norms and Tune the Threshold
After adding clipping, log the pre-clip gradient norm at each training step. In PyTorch, clip_grad_norm_ returns the norm before clipping — store this in a list or pass it to TensorBoard using writer.add_scalar('grad_norm', grad_norm, step). In TensorFlow, you can access per-gradient norms using a custom training step with tf.linalg.global_norm(gradients) before applying the optimizer. What to look for in the logs: if the norm is consistently well below your threshold, training is healthy and clipping is acting as a safety net. If the norm is frequently at or above your threshold, clipping is active on most steps — consider whether a lower learning rate might be more appropriate than relying on clipping alone. If you see NaN norms, the explosion has already occurred upstream and you may need to also add gradient-friendly initialization or reduce the learning rate further.
PyTorch Implementation: clip_grad_norm_
PyTorch provides torch.nn.utils.clip_grad_norm_ as the standard implementation of clip-by-norm. The function accepts an iterable of parameters (typically model.parameters()), a max_norm threshold, and an optional norm_type (default 2.0 for the standard L2 norm). It computes the total gradient norm across all parameter tensors, rescales all gradients in-place if the total exceeds max_norm, and returns the pre-clipping norm as a Python float. This return value is useful for monitoring and can be logged directly to TensorBoard or Weights and Biases.
import torch
import torch.nn as nn
# ── Minimal placement in any PyTorch training loop ──────────────────────
# Step 1: forward pass
outputs = model(inputs)
loss = criterion(outputs, targets)
# Step 2: backward pass — computes gradients for all parameters
loss.backward()
# Step 3: GRADIENT CLIPPING — insert here, after backward, before step
# - model.parameters() : all parameter tensors to clip
# - max_norm=1.0 : L2 norm threshold (starting point for most models)
# - norm_type=2.0 : use L2 norm (default; set to float('inf') for max-norm)
# Returns: pre-clipping total norm (useful for logging / monitoring)
grad_norm = torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
# Step 4: optimizer update — uses clipped gradients
optimizer.step()
optimizer.zero_grad()
# Optional: log the pre-clip norm to track gradient health over time
print(f"Gradient norm before clipping: {grad_norm:.4f}")
The key argument details for clip_grad_norm_ are worth knowing precisely. The parameters argument accepts any iterable of tensors — you can pass model.parameters() to clip the entire network, or a filtered list such as a single layer's parameters if you only want to clip part of the network. The max_norm argument is a float and sets the global L2 norm threshold. The norm_type argument defaults to 2.0 (standard Euclidean norm) but can be set to any positive float — for example, float('inf') clips based on the maximum absolute gradient value across all parameters. The function operates in-place, which means the gradient tensors are modified directly and no new tensor allocation is needed. The return value is a standard Python float representing the total gradient norm computed across all parameters before any clipping was applied.
import torch
import torch.nn as nn
from torch.utils.tensorboard import SummaryWriter
model = MyRNNModel()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()
writer = SummaryWriter()
MAX_GRAD_NORM = 1.0 # Threshold: start here, tune based on logged norms
global_step = 0
for epoch in range(num_epochs):
for batch_idx, (inputs, targets) in enumerate(train_loader):
optimizer.zero_grad()
# ── Forward pass ─────────────────────────────────────────────
outputs = model(inputs)
loss = criterion(outputs, targets)
# ── Backward pass ────────────────────────────────────────────
loss.backward()
# ── Gradient clipping (MUST be after backward, before step) ──
grad_norm = torch.nn.utils.clip_grad_norm_(
model.parameters(),
max_norm=MAX_GRAD_NORM,
norm_type=2.0 # L2 norm (standard)
)
# ── Log norms for monitoring ──────────────────────────────────
writer.add_scalar('train/grad_norm_pre_clip', grad_norm, global_step)
writer.add_scalar('train/loss', loss.item(), global_step)
if grad_norm > MAX_GRAD_NORM:
print(f"[Step {global_step}] Clipped: norm={grad_norm:.2f} > {MAX_GRAD_NORM}")
# ── Optimizer step uses clipped gradients ─────────────────────
optimizer.step()
global_step += 1
writer.close()
TensorFlow and Keras Implementation
TensorFlow's Keras API integrates gradient clipping directly into the optimizer constructor, making it the simplest possible integration — you pass the clipping parameter once and it is applied automatically on every call to optimizer.apply_gradients() or through model.compile() and model.fit(). There are three clipping keyword arguments available, each with a different scope and behaviour.
clipnorm applies clip-by-norm to each individual gradient tensor separately. Each parameter's own gradient is clipped if its L2 norm exceeds the threshold, independently of other parameters. global_clipnorm computes the global L2 norm across all gradients (exactly matching PyTorch's clip_grad_norm_ behaviour) and rescales all gradients proportionally if the global norm exceeds the threshold. For most use cases, global_clipnorm is the closer match to the standard recommended approach. clipvalue implements clip-by-value and clamps every gradient element to the range [-clipvalue, clipvalue].
import tensorflow as tf
# ── Option A: global_clipnorm (recommended — matches PyTorch clip_grad_norm_)
# Clips the GLOBAL L2 norm across ALL gradients simultaneously.
# Preserves relative gradient directions. Standard for transformers and RNNs.
optimizer = tf.keras.optimizers.Adam(
learning_rate=1e-3,
global_clipnorm=1.0 # Global norm threshold
)
# ── Option B: clipnorm (per-tensor norm clipping)
# Each gradient tensor is clipped independently by its own L2 norm.
# Slightly different semantics from global_clipnorm.
optimizer = tf.keras.optimizers.Adam(
learning_rate=1e-3,
clipnorm=1.0 # Per-tensor norm threshold
)
# ── Option C: clipvalue (clip-by-value — element-wise)
# Clamps every gradient element to the range [-1.0, 1.0].
# Simple but changes gradient direction. Use sparingly.
optimizer = tf.keras.optimizers.Adam(
learning_rate=1e-3,
clipvalue=1.0 # Element-wise clamp range
)
# ── Usage — compile and train as normal; clipping is applied automatically
model.compile(optimizer=optimizer, loss='sparse_categorical_crossentropy', metrics=['accuracy'])
model.fit(train_dataset, epochs=10, validation_data=val_dataset)
import tensorflow as tf
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-3)
MAX_GRAD_NORM = 1.0
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
@tf.function
def train_step(x_batch, y_batch):
with tf.GradientTape() as tape:
logits = model(x_batch, training=True)
loss = loss_fn(y_batch, logits)
# Compute raw gradients
gradients = tape.gradient(loss, model.trainable_variables)
# Log the global norm BEFORE clipping for monitoring
global_norm = tf.linalg.global_norm(gradients)
# Apply clip-by-norm manually (equivalent to global_clipnorm in optimizer)
gradients, _ = tf.clip_by_global_norm(gradients, MAX_GRAD_NORM)
# Apply clipped gradients
optimizer.apply_gradients(zip(gradients, model.trainable_variables))
return loss, global_norm
# Training loop
for epoch in range(num_epochs):
for step, (x_batch, y_batch) in enumerate(train_dataset):
loss, grad_norm = train_step(x_batch, y_batch)
if step % 100 == 0:
tf.print(f"Step {step}: loss={loss:.4f}, grad_norm={grad_norm:.4f}")
Clip by Value vs Clip by Norm: A Detailed Comparison
Understanding the technical difference between these two methods prevents a common mistake: choosing clip-by-value because it seems simpler, without realizing that it changes the gradient direction and can produce inconsistent parameter updates. The table below summarizes the key distinctions.
| Property | Clip by Value | Clip by Norm |
|---|---|---|
| Scope of operation | Each individual gradient element independently | The entire gradient vector as a whole unit |
| Effect on direction | Changes direction when elements are clipped unevenly | Preserves direction — uniform scaling of all components |
| Effect on magnitude | Caps maximum element value; does not bound total norm | Bounds total L2 norm to exactly the threshold value |
| Computation cost | O(n) — simple min/max on each element | O(n) — norm computation + scalar multiply |
| PyTorch function | torch.nn.utils.clip_grad_value_(params, clip_value) | torch.nn.utils.clip_grad_norm_(params, max_norm) |
| TensorFlow keyword | clipvalue=1.0 on optimizer constructor | global_clipnorm=1.0 on optimizer constructor |
| Recommended use | Specific RL applications; niche experimental scenarios | Universal standard — all supervised learning, RNNs, transformers |
| Industry default | Uncommon | Yes — Hugging Face, Lightning, DeepSpeed all use norm clipping |
Exploding Gradients vs Vanishing Gradients: Critical Differences
Gradient clipping is the correct solution for exploding gradients, but it has no effect on vanishing gradients. These are two distinct problems with different causes, different symptoms, and entirely different remedies. Confusing them leads to applying the wrong fix and wasting time debugging training that will never improve regardless of how gradient clipping is tuned.
| Dimension | Exploding Gradients | Vanishing Gradients |
|---|---|---|
| What happens | Gradient magnitudes grow exponentially through layers | Gradient magnitudes shrink exponentially toward zero |
| Primary cause | Eigenvalues greater than 1 in weight matrices; BPTT with long sequences | Eigenvalues less than 1; saturating activations (sigmoid, tanh in deep nets) |
| Observable symptoms | Loss spikes to NaN or Inf; loss oscillates wildly; model diverges | Loss plateaus very early; early layers learn slowly or not at all |
| Most affected layers | All layers in RNNs; all layers when it occurs | Early (shallow) layers in deep networks |
| Gradient norm values | Very large (hundreds, thousands, or NaN) | Very small (approaching machine epsilon or exactly zero) |
| Primary fix | Gradient clipping (clip-by-norm) | Residual connections, LSTM/GRU cells, skip connections |
| Supporting fixes | Lower learning rate; gradient-friendly initialization | Xavier/He initialization; Batch Normalization; ReLU activations |
| Does clipping help? | Yes — directly prevents the oversized parameter update | No — clipping near-zero gradients does nothing useful |
Gradient Clipping Fixes Exploding Gradients ONLY — Not Vanishing
A very common mistake is applying gradient clipping when training stalls and seeing no improvement, then concluding that clipping does not work. If the real problem is vanishing gradients — gradients that are near zero rather than explosively large — then clipping cannot help because there is nothing to clip. Check your gradient norms first: if they are large and the loss is spiking, use clipping. If they are very small and the loss is stuck, use residual connections, LSTM/GRU architectures, BatchNorm, or better weight initialization instead. Gradient clipping and vanishing gradient solutions are complementary, not interchangeable.
Gradient Clipping in RNNs and Transformers
Recurrent neural networks remain the most vulnerable architecture for exploding gradients because backpropagation through time (BPTT) involves multiplying the same weight matrix against itself at every time step. For a sequence of length T, the gradient of the loss at time step 1 with respect to the hidden state at time step T involves T multiplications of the recurrent weight matrix. When the spectral radius (largest eigenvalue) of that matrix exceeds one, these repeated multiplications cause exponential growth. Even the improved gating mechanisms of LSTMs and GRUs do not completely eliminate this risk — they reduce the probability of explosions but do not prevent them entirely under adverse initialization or long sequences.
The original gradient clipping paper by Pascanu et al. (2013) specifically identified RNNs as the primary use case and recommended a threshold in the range of 1.0 to 5.0. The technique was subsequently adopted as a default in virtually all major training frameworks. When you use the Hugging Face Trainer class, Trainer(max_grad_norm=1.0) is already included as the default configuration. PyTorch Lightning applies gradient_clip_val=0.5 by default for LSTM-based models when the flag is set. DeepSpeed's ZeRO optimizer includes gradient clipping as a first-class configuration option.
Transformer models trained on long contexts also benefit from gradient clipping. The attention mechanism can produce large gradient magnitudes in the early training stages before the model has learned to distribute attention smoothly. Clipping with a threshold of 1.0 is standard practice for GPT-style and BERT-style pretraining runs and is used in essentially every published large language model training recipe.
Key Insight: One Line That Prevents Training Crashes
Gradient clipping is one of the highest-value-to-effort additions you can make to any RNN or transformer training loop. A single call to torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0) — inserted between loss.backward() and optimizer.step() — prevents the most common cause of training divergence in recurrent and attention-based architectures. It costs virtually nothing in terms of compute (one norm computation and one scalar multiply per step) and does not change model architecture, loss function, or optimizer. For any new RNN or transformer training run, adding this line as a baseline precaution is best practice even before you observe any gradient instability.
Complementary Techniques and What They Do Not Replace
Several other techniques are often mentioned alongside gradient clipping, and it is important to understand exactly what each one does and does not address. None of the following techniques replace gradient clipping for preventing exploding gradients — they address related but distinct problems.
Batch Normalization normalizes the activations within each layer to have zero mean and unit variance during the forward pass. This produces more predictable gradient magnitudes during backpropagation and reduces internal covariate shift, making training faster and more stable overall. However, BatchNorm does not directly prevent gradient explosions — it stabilizes the forward pass, and the backward pass through a batch-normalized layer still involves a Jacobian that can in principle produce large gradients. Use BatchNorm for general training stability, but do not rely on it as a substitute for gradient clipping in RNNs or transformers.
Residual Connections (Skip Connections) provide an additive shortcut path from earlier layers to later layers, ensuring that gradients can flow back through the network without passing through every intermediate transformation. This directly addresses vanishing gradients in deep networks — the shortcut path always carries a gradient of at least one (the identity path). Residual connections do not address exploding gradients because the problematic exponential growth can occur through the shortcut path as well as the main path when the network weights produce large Jacobians.
Xavier and He Initialization scale the initial weight values by functions of the layer dimensions (sqrt(2/fan_in) for He, sqrt(2/(fan_in + fan_out)) for Xavier). This ensures that activations and gradients have approximately unit variance immediately at the start of training, preventing both explosions and vanishing gradients in the very first few steps. Good initialization dramatically reduces the probability of seeing gradient instability early in training but cannot prevent it as the weights update away from their initial values over the course of training. Use proper initialization as your first line of defence and gradient clipping as a persistent safety net throughout the entire training run.
Frequently Asked Questions
What is the best max_norm value to start with for gradient clipping?
A threshold of 1.0 is the standard starting point and works well for the majority of architectures including RNNs, LSTMs, and transformers. It is the default used by Hugging Face Transformers and DeepSpeed. The practical tuning range is 0.5 to 5.0. To calibrate precisely, run a few hundred training steps with clipping disabled, log the gradient norm at each step, then set your threshold above the typical median norm but below the peak spike values. If training is stable without clipping, a threshold of 1.0 still serves as a useful safety net with negligible cost.
Does gradient clipping slow down training or hurt model accuracy?
When the threshold is set appropriately — above the typical gradient norm during stable training — clipping activates rarely and has no meaningful effect on final accuracy. The compute overhead is negligible: one norm calculation and one scalar multiply per training step. If clipping is active on most training steps (indicating the threshold is too low), it effectively reduces the learning rate and can slow convergence, but this is a tuning issue rather than an inherent flaw in the technique. Setting the threshold based on observed gradient norms ensures clipping only activates during genuine instability events.
Where exactly in a PyTorch training loop should clip_grad_norm_ be called?
The call must be placed after loss.backward() (which computes the gradients) and before optimizer.step() (which applies them to the parameters). Calling it before backward() has no effect because the gradients do not exist yet. Calling it after optimizer.step() clips gradients that have already been applied, which has no effect on the current step. The correct sequence is always: loss.backward() then clip_grad_norm_ then optimizer.step() then optimizer.zero_grad().
Should I use clipnorm or global_clipnorm in TensorFlow?
global_clipnorm is the closer match to PyTorch's clip_grad_norm_ and to the standard gradient clipping literature. It computes the L2 norm across all gradient tensors simultaneously and scales them all down proportionally if the global norm exceeds the threshold — preserving the relative magnitudes between different layers. clipnorm clips each individual gradient tensor independently by its own norm, which can result in different scaling factors for different layers. For most supervised learning tasks, either works, but global_clipnorm is semantically more consistent with the standard algorithm.
Can gradient clipping fix NaN losses, or is it too late once NaN appears?
Gradient clipping prevents the oversized parameter update that would cause a future NaN, but once parameters contain NaN values, all subsequent forward passes will also produce NaN. If your loss has already gone NaN, adding clipping to the current run will not recover it — you need to restart training with clipping enabled from the beginning. When debugging a NaN loss, check whether it occurred at a specific step and look at the gradient norm at that step. If the norm was in the hundreds or thousands just before the NaN appeared, exploding gradients are the confirmed cause and clipping will prevent it in the next run. Also ensure your learning rate is not excessively high, as a high learning rate is a common co-contributor to gradient explosions.
Need Expert Help with AI and Machine Learning Projects?
Our AI and machine learning consultants can help you architect, debug, and optimize deep learning training pipelines — from preventing exploding gradients and stabilizing transformer training to scaling distributed runs for production models.
About the author
Co-founder & AI Practice Lead, Braincuber Technologies
Co-founder at Braincuber. Builds production AI agents (Anthropic Claude, OpenAI, AWS Bedrock) for US fintech, healthcare, and retail clients with SOC 2 Type II / HIPAA-scope deployments. Joins every architecture review personally.
