How to Diagnose Overfitting vs Underfitting in Machine Learning
Overfitting and underfitting are the two ways a machine learning model fails to generalize. This complete tutorial is a step by step guide and beginner guide that shows you exactly how to tell the two apart and how to fix each one. You will learn how to read the gap between training and test performance, how to plot and interpret learning curves, how the bias-variance tradeoff explains both failure modes, and which concrete fixes — regularization, early stopping, more data, simpler or more complex models — actually move the needle. By the end of this guide you will be able to look at a model's training and validation scores, name the problem confidently, and apply the right correction instead of guessing.
What You'll Learn:
- What overfitting and underfitting actually mean and how they differ
- How to diagnose the problem from the gap between training and test error
- How to plot and read learning curves to identify high bias vs high variance
- How the bias-variance tradeoff ties both failure modes together
- Proven fixes including regularization (L1/L2), early stopping, dropout, and cross-validation
- How overfitting and underfitting show up in linear models, decision trees, and neural networks
- A practical scikit-learn diagnostic that scores a model on train vs test data
- The common diagnostic mistakes that hide overfitting from beginners
What Are Overfitting and Underfitting?
Every supervised machine learning model is trying to learn a pattern from training data that also holds on data it has never seen. Generalization is the goal: a model that performs well not just on the examples it was trained on, but on new, real-world inputs. Overfitting and underfitting are the two opposite ways a model fails to generalize, and almost every model-quality problem you will encounter is a version of one of them.
Underfitting happens when a model lacks sufficient complexity to capture the patterns in the data. The model is too simple for the job. The telltale signature is that training accuracy is low and test accuracy is low — both are bad together. An underfit model has not even learned the training set well, so there is no chance it generalizes. It is like trying to fit a straight line through data that clearly curves: the line is wrong everywhere.
Overfitting happens when a model memorizes the training data instead of learning generalizable patterns. The model is too complex relative to the signal in the data, so it starts fitting the noise. The telltale signature is that training accuracy looks great, but test accuracy is horrible. The model has essentially memorized the answers to the training examples — including their random quirks — and falls apart the moment it sees something new.
Train-Test Gap Analysis
The fastest diagnostic is comparing training error to test error. A small gap with high errors on both means underfitting. A large gap — low training error but high test error — means overfitting. A small gap with low errors on both is the optimal model you are aiming for. This single comparison tells you which problem you have before you run any other analysis.
Learning Curve Diagnosis
Learning curves plot training and validation performance as training progresses or as data grows. Underfit curves both flatten quickly at a high error level. Overfit curves show the training curve dropping toward zero error while the validation curve stays elevated — a widening gap that signals memorization. The shape of the two curves is one of the clearest visual diagnostics available.
Bias-Variance Tradeoff
Underfitting is high bias: rigid assumptions produce consistent but wrong predictions. Overfitting is high variance: sensitivity to the training data produces wildly different predictions on similar inputs. The bias-variance tradeoff is the lens that unifies both failure modes — and the key insight is that you cannot eliminate either one, you can only shift the balance between them.
Regularization & Early Stopping
When you confirm overfitting, regularization and early stopping are usually the first things to try. L1/L2 regularization penalizes large weights to keep the model simple; early stopping halts training the moment validation performance stops improving. Both are cheap, fast, and effective. For neural networks, dropout adds another robust layer of defense against memorization.
The Key Difference: How Each One Behaves
The cleanest way to internalize overfitting vs underfitting is to watch how training and validation performance behave together. They move differently in each case, and that difference is your primary diagnostic.
Underfitting looks flat and mediocre. Performance is poor and roughly equal across both the training and validation sets. The model never learns substantial patterns, so adding more training time or more data does not help much — both curves plateau at a disappointing level. The model simply is not capable of representing the underlying relationship.
Overfitting looks like a growing gap. Training scores keep improving — sometimes approaching perfection — while test or validation scores stall or actively decline. The widening distance between the two curves is the unmistakable fingerprint of a model that is memorizing rather than learning. The better it gets on training data, the worse it gets on everything else.
How to Diagnose: A Step by Step Guide
Diagnosing overfitting vs underfitting is a repeatable process. Follow these six steps in order and you will be able to name the problem and pick the right fix every time. Each step builds on the previous one, moving from raw measurement to a concrete remediation decision.
Split Your Data
You cannot diagnose generalization with one dataset. Use train_test_split to hold out a portion of your data — typically 20-30% — that the model never sees during training. This test set is your ground truth for generalization. For more rigorous work, also reserve a separate validation set for tuning so the test set stays untouched until the very end. Without a held-out set, every accuracy number you see is contaminated by the data the model already memorized, and overfitting becomes invisible.
Compare Train vs Test Error
Score the model on both sets and compare. A small gap with high errors on both means underfitting — the model is too weak. A large gap with low training error but high test error means overfitting — the model memorized the training set. A small gap with low errors on both is the optimal result you want. This three-way comparison is the single most informative diagnostic in machine learning, and it takes two lines of code with model.score.
Plot Learning Curves
Numbers tell you what; learning curves tell you why. Plot training and validation error against either training set size or training iterations. Underfit curves both flatten quickly at high error levels and converge close together — adding data will not help. Overfit curves show the training curve dropping toward near-zero error while the validation curve stays stubbornly elevated; the widening gap is direct evidence of memorization. Use scikit-learn's validation_curve or learning_curve to generate these automatically.
Identify Bias or Variance
Translate the symptom into its root cause. Underfitting is high bias: the model makes rigid assumptions and produces consistent but incorrect predictions across different samples. Overfitting is high variance: the model is so sensitive to the specifics of its training data that it produces highly variable predictions on similar inputs. Naming the problem as bias or variance directly points you to the correct family of fixes — you increase capacity to cut bias, and you constrain capacity to cut variance.
Apply the Right Fix
For underfitting: increase model complexity or flexibility, add meaningful features, train longer, and reduce regularization constraints. For overfitting: expand the dataset, apply L1/L2 regularization to penalize large weights, reduce model complexity, use dropout for neural networks, and apply early stopping when validation performance plateaus. Regularization and early stopping are usually the first things to try because they are cheap and effective. Change one thing at a time so you can attribute the improvement.
Validate with Cross-Validation
A single train-test split can be misleading if the split happened to be lucky or unlucky. K-fold cross-validation rotates the held-out fold across the whole dataset and averages the results, giving a far more reliable estimate of generalization. Use cross_val_score to confirm that your fix actually improved performance and was not just noise. Low variance across the folds is itself a good sign that your model is stable and not overfitting to any particular slice of the data.
Identifying the Problem: Two Reliable Methods
Method 1: Training vs Test Error Analysis
The most direct method is to measure error on the training set and the test set and read the relationship between them. There are exactly three outcomes worth distinguishing, and each maps to a clear diagnosis:
Small gap and high errors on both means underfitting. The model performs about equally badly on data it has seen and data it has not. It never learned the pattern in the first place. Large gap with low training error and high test error means overfitting. The model nailed the training data but cannot transfer that to new inputs. Small gap with low errors on both means you have found a well-fit, generalizing model — the goal.
Method 2: Learning Curves
Learning curves add a temporal and visual dimension that single numbers miss. They plot how training and validation performance evolve as you add more data or run more training iterations.
Underfit learning curves show both the training and validation curves flattening quickly at a high error level, sitting close to each other. The model has hit its capacity ceiling; more data will not rescue it. Overfit learning curves show the training curve dropping to near-zero error while the validation curve stays elevated. The persistent, widening gap between them is the clearest visual evidence that the model is memorizing the training set rather than learning a generalizable rule.
| Aspect | Underfitting | Overfitting |
|---|---|---|
| Definition | Model lacks the complexity to capture the data's patterns | Model memorizes training data instead of learning general patterns |
| Training accuracy | Low | Very high (looks great) |
| Test accuracy | Low | Low (horrible) |
| Train-test gap | Small (both bad) | Large and growing |
| Bias-variance | High bias | High variance |
| Learning curve shape | Both curves flatten quickly at high error | Training drops near zero; validation stays high |
| First fixes to try | More complexity, more features, train longer | Regularization, early stopping, more data |
Root Causes of Each Problem
Knowing the symptom is only half the battle. To fix the problem permanently you need to understand what causes it. The root causes of underfitting and overfitting are essentially mirror images of each other.
What Causes Underfitting
Model oversimplification is the most common cause — applying a linear model to a relationship that is genuinely curved, for example. Insufficient or wrong features starve the model of the information it needs; if the predictive signal is not in the inputs, no amount of training will find it. Inadequate training duration can also cause underfitting: a model stopped before it converged has not yet learned the patterns that are within its reach.
What Causes Overfitting
Excessive model complexity relative to the data gives the model enough capacity to memorize. Too many features create spurious correlations that the model latches onto as if they were real signal. Limited training data makes memorization easy — with few examples, the model can simply store them. And prolonged training beyond the point of pattern convergence pushes the model to keep optimizing on the training set long after it has learned everything generalizable, at which point it starts fitting noise.
Relying Solely on Training Accuracy Hides Overfitting
A 99% training accuracy tells you nothing about generalization on its own — an overfit model that memorized the training set will report exactly that number while failing completely on new data. Always evaluate on a separate validation or test set the model never saw during training. Judging a model by its training score alone is the single most common beginner mistake, and it makes overfitting completely invisible until the model fails in production.
The Bias-Variance Tradeoff
The bias-variance tradeoff is the theoretical framework that unifies overfitting and underfitting. Every model's expected error can be decomposed into bias, variance, and irreducible noise — and the first two are exactly the quantities that overfitting and underfitting push around.
High bias corresponds to underfitting. A high-bias model makes rigid, oversimplified assumptions about the data. It produces predictions that are consistent across different training samples but consistently wrong — it misses the real relationship the same way every time. A straight line fit to curved data is the canonical high-bias model.
High variance corresponds to overfitting. A high-variance model is extremely sensitive to the specific training data it saw. Train it on two slightly different samples and you get two very different models, producing highly variable predictions on similar inputs. It chases the noise instead of the signal.
Key Insight: The Tradeoff Is a Shift, Not an Elimination
You cannot fully eliminate either bias or variance — you can only shift the balance between them. Reducing bias by adding model complexity tends to increase variance, and reducing variance by constraining the model tends to increase bias. The goal of diagnosing overfitting vs underfitting is not to drive one term to zero, but to find the sweet spot where their combined contribution to total error is minimized. Understanding that you are trading, not eliminating, is what turns model tuning from random guessing into a principled search.
Fixes by Problem
Once you have diagnosed the problem and identified whether it is bias or variance, the remediation strategy follows directly. The table below maps each problem to its causes and the fixes that address them, in roughly the order you should try them.
| Problem | Cause | Recommended Fixes |
|---|---|---|
| Underfitting (high bias) | Model too simple; missing or weak features; training stopped too early | Increase model complexity or flexibility; add meaningful features; train longer; reduce regularization constraints |
| Overfitting (high variance) | Model too complex; too many features; too little data; trained too long | Apply L1/L2 regularization and early stopping first; expand dataset; reduce complexity; add dropout (neural nets); use cross-validation |
| Spurious correlations | Too many features create patterns that are noise, not signal | Feature selection; L1 regularization (drives weights to zero); dimensionality reduction |
| Unstable across data slices | High variance; results depend heavily on the particular split | K-fold cross-validation; more training data; ensemble methods |
Remediation Strategies in Depth
Addressing Underfitting
Underfitting means the model needs more capacity or better information. Increase model complexity or flexibility — move from a linear model to a polynomial or tree-based model, or add layers and units to a neural network. Add meaningful features that carry more of the predictive signal, including engineered interaction or polynomial terms. Extend training duration so the model has time to converge. And reduce regularization constraints if you over-penalized the model into rigidity.
Addressing Overfitting
Overfitting means the model needs constraint or more data. Expand the dataset — more examples make memorization harder and generalization easier, and data augmentation can multiply effective dataset size. Apply L1/L2 regularization to penalize large weights and keep the model simple; L1 also performs implicit feature selection by driving some weights to exactly zero. Reduce model complexity or parameters. Implement cross-validation to get honest performance estimates. Use dropout for neural networks to prevent co-adaptation of neurons. And apply early stopping the moment validation performance plateaus. Regularization and early stopping are usually the first things to try because they require almost no extra data or compute.
Model-Specific Manifestations
Overfitting and underfitting wear different clothes depending on the model. Recognizing the model-specific signature helps you spot the problem faster and choose a fix tailored to the algorithm you are using.
| Model Type | How It Underfits | How It Overfits |
|---|---|---|
| Linear Models | Linear assumptions applied to curved or nonlinear data | Too many polynomial or interaction terms fitting the noise |
| Decision Trees | Shallow splits that stop before capturing the structure | Excessive depth creating leaves that hold a single training sample |
| Neural Networks | Network too small or training stopped prematurely | Millions of parameters memorize the training set without dropout or regularization |
Two Classic Practical Examples
Polynomial Regression on a Noisy Sine Wave
Imagine data sampled from a sine wave with a little random noise added. A degree-1 polynomial (a straight line) underfits badly — it cannot bend to follow the curve at all, producing high error everywhere. A degree-3 polynomial captures the actual underlying shape of the sine wave beautifully, with low error on both training and test data. A degree-15 polynomial overfits dramatically: it oscillates wildly to pass through every noisy training point, achieving near-zero training error while making absurd predictions between the points. This one example contains the entire spectrum from underfitting through optimal fit to overfitting.
Decision Tree Depth
Plotting error against tree depth produces the textbook bias-variance picture. As depth increases, training error decreases continuously — a deep enough tree can memorize the training set perfectly. Test error initially drops then rises again: shallow trees underfit, very deep trees overfit, and the optimal depth sits at the bottom of the U-shaped test-error curve where the two competing effects balance. Tuning max_depth is one of the most concrete, visual ways to experience the tradeoff in practice.
Python Diagnostic with scikit-learn: Step by Step Guide
The following code shows a complete, realistic diagnostic workflow. We generate noisy nonlinear data, fit models of varying complexity, and score each on both training and test sets so the train-test gap is visible directly. Install the dependencies first with pip install scikit-learn numpy matplotlib.
# pip install scikit-learn numpy
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
# 1. Generate a noisy sine wave (true pattern + random noise)
rng = np.random.RandomState(42)
X = np.sort(rng.uniform(0, 1, 80)).reshape(-1, 1)
y = np.sin(2 * np.pi * X.ravel()) + rng.normal(0, 0.25, X.shape[0])
# 2. Hold out a test set the model never sees during training
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42
)
# 3. Fit models of increasing complexity and score train vs test
print(f"{'Degree':>6} | {'Train R2':>9} | {'Test R2':>9} | Diagnosis")
print("-" * 55)
for degree in [1, 3, 15]:
model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
model.fit(X_train, y_train)
train_score = model.score(X_train, y_train) # R2 on training data
test_score = model.score(X_test, y_test) # R2 on held-out data
gap = train_score - test_score
if train_score < 0.5 and test_score < 0.5:
diagnosis = "UNDERFIT (high bias)"
elif gap > 0.2:
diagnosis = "OVERFIT (high variance)"
else:
diagnosis = "GOOD FIT"
print(f"{degree:>6} | {train_score:>9.3f} | {test_score:>9.3f} | {diagnosis}")
# Typical output:
# Degree | Train R2 | Test R2 | Diagnosis
# -------------------------------------------------------
# 1 | 0.28 | 0.21 | UNDERFIT (high bias)
# 3 | 0.92 | 0.88 | GOOD FIT
# 15 | 0.99 | 0.31 | OVERFIT (high variance)
Read the output as a story. Degree 1 has low scores on both sets and a tiny gap — classic underfitting. Degree 3 has high scores on both with a small gap — the model generalizes. Degree 15 has a near-perfect training score but a collapsed test score and a huge gap — textbook overfitting. The same model.score call on train versus test, repeated across complexity, exposes the full diagnosis.
The next snippet uses scikit-learn's validation_curve to sweep a regularization strength and cross_val_score to confirm the chosen setting generalizes — combining the early-stopping-style sweep with cross-validation in one diagnostic.
# pip install scikit-learn numpy
import numpy as np
from sklearn.model_selection import validation_curve, cross_val_score
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures, StandardScaler
from sklearn.linear_model import Ridge
# Same noisy sine data
rng = np.random.RandomState(42)
X = np.sort(rng.uniform(0, 1, 120)).reshape(-1, 1)
y = np.sin(2 * np.pi * X.ravel()) + rng.normal(0, 0.25, X.shape[0])
# A flexible (degree-12) model that WILL overfit without regularization.
# Ridge's alpha is the L2 penalty strength: higher alpha = more regularization.
pipe = make_pipeline(
PolynomialFeatures(12), StandardScaler(), Ridge()
)
alphas = np.logspace(-5, 2, 8) # from almost no penalty to strong penalty
# validation_curve scores train vs validation across each alpha (5-fold CV)
train_scores, val_scores = validation_curve(
pipe, X, y,
param_name="ridge__alpha", param_range=alphas,
cv=5, scoring="r2"
)
print(f"{'alpha':>10} | {'Train R2':>9} | {'Val R2':>9}")
print("-" * 36)
for a, tr, va in zip(alphas, train_scores.mean(1), val_scores.mean(1)):
print(f"{a:>10.5f} | {tr:>9.3f} | {va:>9.3f}")
# Low alpha -> high train R2, low val R2 (OVERFIT).
# Too-high alpha -> both drop (UNDERFIT).
# The best alpha maximizes validation R2: the regularization sweet spot.
best_alpha = alphas[np.argmax(val_scores.mean(1))]
print(f"\nBest alpha (max val R2): {best_alpha:.5f}")
# Confirm the chosen setting generalizes with k-fold cross-validation
pipe.set_params(ridge__alpha=best_alpha)
cv = cross_val_score(pipe, X, y, cv=5, scoring="r2")
print(f"5-fold CV R2: {cv.mean():.3f} +/- {cv.std():.3f}")
# Low std across folds = stable model, not overfitting to one split.
This second diagnostic shows regularization in action. With a tiny alpha, the degree-12 model overfits — high training score, weak validation score. With too large an alpha, it underfits — both scores collapse. The best alpha is the one that maximizes validation performance, and the low cross-validation standard deviation confirms the chosen model is stable across data slices rather than overfit to a single lucky split.
Common Diagnostic Mistakes to Avoid
Most overfitting that ships to production was not detected because of a handful of avoidable diagnostic mistakes. Watch for these:
Relying solely on training accuracy. A high training score is meaningless without a held-out comparison — it is exactly what an overfit model produces. Neglecting separate validation datasets. Tuning on the test set leaks information and inflates your estimate of generalization. Assuming greater complexity ensures better performance. Beyond a point, more capacity hurts by inviting overfitting; bigger is not automatically better. Treating random fluctuations as signal. Small differences in validation score between runs are often noise, not real improvement — confirm with cross-validation before acting on them.
Frequently Asked Questions
How do I know if my model is overfitting or underfitting?
Compare training and test performance. If both scores are low and close together, the model is underfitting (high bias). If the training score is high but the test score is low — a large, growing gap — the model is overfitting (high variance). If both scores are high and close, the model fits well. The train-test gap is the single most reliable signal, and learning curves make the pattern even clearer.
What is the difference between overfitting and underfitting?
Underfitting means the model is too simple to capture the data's patterns, so both training and test accuracy are low. Overfitting means the model memorized the training data including its noise, so training accuracy looks great but test accuracy is poor. Underfitting is high bias; overfitting is high variance. They are opposite failures of generalization, and the fix for one is roughly the opposite of the fix for the other.
What is the fastest way to fix an overfitting model?
Try regularization and early stopping first — they are cheap, fast, and usually effective. Add an L1 or L2 penalty to discourage large weights, and stop training the moment validation performance stops improving. If the gap persists, gather more data, reduce model complexity, or add dropout for neural networks. Cross-validation helps confirm that your fix actually improved generalization rather than just shifting noise around.
Can a model overfit and underfit at the same time?
Not on the same data in the same way — they are opposite ends of the bias-variance spectrum. A single model is either too simple (underfit), too complex (overfit), or balanced. However, a model can underfit some regions of the input space while overfitting others, especially with uneven data. The bias-variance tradeoff means reducing one tends to increase the other, so you aim for the balance point, not the elimination of either.
Does more training data always reduce overfitting?
More data usually reduces overfitting because it makes memorization harder and forces the model to learn real patterns. But it does not help underfitting — if the model is too simple, more data just gives it more examples it still cannot fit. More data also will not save a model whose features lack the predictive signal. Match the remedy to the diagnosis: more data for high variance, more capacity or better features for high bias.
Need Expert Help with AI and Machine Learning?
Our AI and ML consultants can help you diagnose model performance issues, tune the bias-variance tradeoff, apply regularization and early stopping correctly, and build models that generalize reliably to production data.
About the author
Founder & CEO, Braincuber Technologies
Founder and CEO of Braincuber. Has scoped and shipped 500+ Odoo, AI, and cloud projects for US mid-market and global brands. Takes every founder call personally — no SDR layer between buyers and the people building the system.
