ResNet Architecture: How Skip Connections Revolutionized Deep Learning
By Braincuber Team
Published on April 17, 2026
ResNet, short for Residual Network, is a groundbreaking deep learning architecture introduced by Microsoft Research in 2015 that fundamentally changed how we train deep neural networks. By introducing the innovative concept of skip connections, ResNet made it possible to reliably train networks with 50, 100, or even 150+ layers without performance degradation.
What You'll Learn:
- Why deep neural networks are difficult to train
- How residual learning differs from traditional approaches
- The mechanism of skip connections and their benefits
- ResNet block architectures (basic vs. bottleneck)
- How to implement ResNet in PyTorch
- Real-world applications and transfer learning strategies
Why Deep Networks Are Hard to Train
More layers should mean more capacity for a neural network to learn complex patterns. In practice, past a certain depth, things start breaking down. There are two fundamental problems that make training deep networks challenging.
The Vanishing Gradient Problem
Neural networks learn through backpropagation, sending error signals backward through the network. Each layer adjusts its weights based on that signal. However, as the signal travels back through many layers, it gets multiplied by small numbers repeatedly and shrinks exponentially. By the time it reaches the early layers, there is almost nothing left to guide their learning.
The Degradation Problem
This problem is counterintuitive. You would expect a 56-layer network to perform at least as well as a 20-layer one since it has more capacity. However, researchers discovered the opposite: deeper networks performed worse, even on training data. This rules out overfitting as the cause. The model is not memorizing too much; it is simply unable to find good weights during optimization.
Key Insight
These are optimization problems, not generalization problems. You cannot fix them with dropout or regularization alone. ResNet was specifically designed to solve these training difficulties.
The Core Idea: Residual Learning
Traditional neural networks try to learn a direct mapping from input to output. Each layer examines what came in and attempts to determine what should come out. This approach works fine for shallow networks but breaks down as networks go deeper.
With ResNet, instead of asking each block to learn the full mapping, the architecture asks a simpler question: what do I need to add to the input to get the right output?
That difference is called the residual. Instead of learning H(x) directly, the network learns F(x) = H(x) - x, which represents the small correction needed. If the layer does not need to change anything, it can push F(x) toward zero and pass the input through unchanged.
| Approach | What It Learns | Difficulty |
|---|---|---|
| Traditional CNN | H(x) - full transformation | Hard for deep networks |
| ResNet | F(x) = H(x) - x - residual | Easier optimization |
What Are Skip Connections in ResNet?
A skip connection is exactly what it sounds like: a direct path that bypasses one or more layers and feeds the input directly to a later point in the network. In a traditional network, data flows through each layer in sequence. Skip connections take the original input and add it directly to the output of a layer further down the block.
Direct Gradient Path
During backpropagation, gradients can travel back through skip connections without passing through intermediate layers, keeping the signal strong.
Easier Optimization
Learning a small correction is much easier than learning a full transformation from scratch, making deeper networks trainable.
Structure of a ResNet Block
A residual block is the repeating unit that makes up a ResNet. Understanding one block gives you a clear picture of how the entire network operates.
Input Entry
The input x enters the block and splits into two paths: the main path through convolutions and the skip connection path.
Convolutional Layers
One path goes through two convolutional layers, each followed by batch normalization and a ReLU activation function.
Skip Connection
The other path skips those layers entirely - this is the identity mapping where the input passes through unchanged.
Addition Step
Both paths meet at an addition operation where F(x) + x is computed. A final ReLU activation is applied.
Projection Shortcuts
For the addition to work, both paths must produce tensors of the same shape. When dimensions change, ResNet applies a 1x1 convolution on the skip path to reshape x to match.
Types of ResNet Architectures
ResNet comes in several standard variants, each named after its total number of layers. The right choice depends on your priorities: speed, accuracy, or a balance between both.
| Variant | Block Type | Best For |
|---|---|---|
| ResNet-18 / ResNet-34 | Basic block | Limited hardware, fast prototyping |
| ResNet-50 | Bottleneck block | Most practical work (default) |
| ResNet-101 / ResNet-152 | Bottleneck block | Research, accuracy priority |
ResNet Bottleneck Architecture
Deeper ResNets (starting from ResNet-50) use a different block design called the bottleneck block. This three-layer design keeps computation manageable as depth increases.
Bottleneck Block Structure
The first and last 1x1 convolutions act as a bottleneck: they compress the data before the expensive 3x3 convolution runs, then restore it afterward. A 3x3 convolution on a high-channel input is computationally heavy. By reducing channels first, the bottleneck block achieves greater depth without a proportional jump in compute cost.
Implementing ResNet in PyTorch
Here is a step-by-step guide to implementing a basic ResNet block and the full ResNet-18 architecture in PyTorch.
import torch
import torch.nn as nn
class BasicBlock(nn.Module):
expansion = 1
def __init__(self, in_channels, out_channels, stride=1):
super(BasicBlock, self).__init__()
self.conv1 = nn.Conv2d(in_channels, out_channels, kernel_size=3,
stride=stride, padding=1, bias=False)
self.bn1 = nn.BatchNorm2d(out_channels)
self.relu = nn.ReLU(inplace=True)
self.conv2 = nn.Conv2d(out_channels, out_channels, kernel_size=3,
stride=1, padding=1, bias=False)
self.bn2 = nn.BatchNorm2d(out_channels)
self.shortcut = nn.Sequential()
if stride != 1 or in_channels != out_channels * self.expansion:
self.shortcut = nn.Sequential(
nn.Conv2d(in_channels, out_channels * self.expansion,
kernel_size=1, stride=stride, bias=False),
nn.BatchNorm2d(out_channels * self.expansion)
)
def forward(self, x):
out = self.relu(self.bn1(self.conv1(x)))
out = self.bn2(self.conv2(out))
out += self.shortcut(x)
out = self.relu(out)
return out
class ResNet18(nn.Module):
def __init__(self, num_classes=1000):
super(ResNet18, self).__init__()
self.in_channels = 64
self.conv1 = nn.Conv2d(3, 64, kernel_size=7, stride=2, padding=3, bias=False)
self.bn1 = nn.BatchNorm2d(64)
self.relu = nn.ReLU(inplace=True)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)
self.layer1 = self._make_layer(BasicBlock, 64, 2, stride=1)
self.layer2 = self._make_layer(BasicBlock, 128, 2, stride=2)
self.layer3 = self._make_layer(BasicBlock, 256, 2, stride=2)
self.layer4 = self._make_layer(BasicBlock, 512, 2, stride=2)
self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
self.fc = nn.Linear(512, num_classes)
def _make_layer(self, block, out_channels, num_blocks, stride):
strides = [stride] + [1] * (num_blocks - 1)
layers = []
for stride in strides:
layers.append(block(self.in_channels, out_channels, stride))
self.in_channels = out_channels * block.expansion
return nn.Sequential(*layers)
def forward(self, x):
x = self.relu(self.bn1(self.conv1(x)))
x = self.maxpool(x)
x = self.layer1(x)
x = self.layer2(x)
x = self.layer3(x)
x = self.layer4(x)
x = self.avgpool(x)
x = torch.flatten(x, 1)
x = self.fc(x)
return x
model = ResNet18(num_classes=1000)
print(model)
import torchvision.models as models
model = models.resnet50(pretrained=True)
for param in model.parameters():
param.requires_grad = False
model.fc = nn.Linear(model.fc.in_features, num_classes)
for param in model.fc.parameters():
param.requires_grad = True
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.fc.parameters(), lr=0.001)
How ResNet Solves the Vanishing Gradient Problem
The vanishing gradient problem stems from distance. The further a gradient has to travel through a network, the more it shrinks during backpropagation. By the time it reaches the early layers, there is not enough signal to drive meaningful learning.
Skip connections solve this by providing gradients with a shorter path. During backpropagation, gradients do not have to pass through every layer in sequence. They can travel back through the skip connection directly, completely bypassing the convolutional layers. This shortcut keeps the gradient large enough to actually update the early layers.
Gradient Flow Comparison
Traditional CNN
Gradient: Layer 1 -> Layer 2 -> Layer 3 -> ... -> Layer N
Result: Signal diminishes exponentially
ResNet with Skip Connections
Gradient: Layer N --skip--> Layer 1 (direct path)
Result: Strong gradient signal preserved
Real-World Applications
ResNet architecture powers a wide range of real-world computer vision applications. Its pretrained weights and proven effectiveness make it a go-to choice for many production systems.
Image Classification
ResNet won the ImageNet challenge in 2015 and remains a top choice for classifying images into categories. Used for medical scans, satellite imagery, and product photos.
Object Detection
Frameworks like Faster R-CNN and Mask R-CNN use ResNet for feature extraction. ResNet handles the feature extraction while detection heads identify and localize objects.
Transfer Learning
Instead of training from scratch, load pretrained ResNet weights and fine-tune on your dataset. Pretrained weights encode useful low-level features like edges and textures.
Feature Extraction
Run images through pretrained ResNet and extract outputs from later layers. These dense representations feed into simpler classifiers or clustering algorithms.
ResNet vs Traditional CNN Architectures
Traditional CNNs and ResNets both learn features from images, but they differ significantly in their approach to information flow and training dynamics.
| Aspect | Traditional CNN | ResNet |
|---|---|---|
| Data Flow | Sequential through all layers | Parallel paths with skip connections |
| Gradient Flow | Through every layer (vanishes) | Direct path via skip connections |
| Training Depth | Limited (degradation problem) | 50-152+ layers reliably |
| Optimization | Difficult for deep networks | Smoother, more reliable |
Advantages and Limitations
Advantages
- + Train networks with 100+ layers reliably
- + Stable training with predictable convergence
- + Strong performance on image benchmarks
- + Excellent transfer learning capabilities
- + Well-supported across all major frameworks
Limitations
- - Computationally heavy for deeper variants
- - High memory usage during training
- - May be overkill for simpler tasks
- - Replaced in some areas by EfficientNet, Vision Transformers
ResNet in Modern Deep Learning
More than a decade after its introduction, ResNet architecture remains a cornerstone of modern deep learning. Most practitioners still reach for ResNet when they need a reliable baseline for computer vision tasks. It is well-understood, well-supported, and pretrained weights are available in every major deep learning library.
ResNet's influence extends beyond its own variants. The core idea of adding shortcuts between layers to help information and gradients flow became a standard building block. DenseNet improved on this by connecting every layer to every other layer. Vision transformers incorporate residual connections inside each transformer block, following the same principle ResNet introduced.
Newer architectures like EfficientNet, ConvNeXt, and vision transformers have pushed performance further in specific areas. But they built on top of what ResNet established rather than replacing it entirely. When you need something that works without extensive experimentation, ResNet is still a solid choice.
Frequently Asked Questions
What is ResNet and why was it important?
ResNet (Residual Network) is a deep learning architecture introduced by Microsoft Research in 2015. It solved two problems that made training deep networks difficult: vanishing gradients and the degradation problem. Skip connections made it possible to reliably train networks with 50, 100, or even 150+ layers for the first time.
What are skip connections in a neural network?
A skip connection is a direct path that bypasses one or more layers and adds the input straight to the output of a later layer. This gives both data and gradients a shortcut through the network, keeping the gradient signal strong enough to update early layers during training.
What is the vanishing gradient problem?
The vanishing gradient problem occurs when gradients shrink exponentially as they travel backward through a deep network. By the time the signal reaches early layers, it is too weak to update them effectively. ResNet addresses this through skip connections that allow gradients to flow back directly without passing through intermediate layers.
What is the difference between ResNet's basic block and bottleneck block?
The basic block uses two 3x3 convolutional layers and is found in ResNet-18 and ResNet-34. The bottleneck block, used in ResNet-50 and deeper variants, uses a 1x1-3x3-1x1 convolution sequence. This design reduces computation by compressing channels before the expensive 3x3 convolution and restoring them afterward.
How do I choose the right ResNet variant for my project?
For most practical work, ResNet-50 is the recommended default as it balances depth, accuracy, and compute cost. ResNet-18 and ResNet-34 are faster options for limited hardware or prototyping. ResNet-101 and ResNet-152 make sense when accuracy is the priority and compute resources are not constrained.
Need Help Implementing ResNet in Your Project?
Our deep learning experts can help you implement ResNet architectures, set up transfer learning pipelines, and optimize your computer vision models for production.
