How to Use Markov Chain Monte Carlo for Bayesian Inference
Markov Chain Monte Carlo (MCMC) is the engine that makes practical Bayesian inference possible. This complete tutorial is a step by step guide and beginner guide to understanding how MCMC works, why it is needed, and how to implement it in Python using PyMC. Whether you are encountering the term for the first time or want a deeper working understanding before applying it to real models, you will learn how MCMC generates samples from posterior distributions that are impossible to compute exactly, how to run four foundational algorithms — Metropolis, Metropolis-Hastings, Gibbs Sampling, and Hamiltonian Monte Carlo — and how to use convergence diagnostics including R-hat, effective sample size, and trace plots to verify that your sampler has actually found the right distribution. By the end of this tutorial you will be able to build, sample, and diagnose a Bayesian model end to end in Python.
What You'll Learn:
- What Markov Chain Monte Carlo is and why it is required for Bayesian posterior inference
- How Markov Chains and Monte Carlo methods combine to generate posterior samples
- The four core MCMC algorithms: Metropolis, Metropolis-Hastings, Gibbs Sampling, and HMC/NUTS
- How to define a Bayesian model and run the MCMC sampler in PyMC
- How to diagnose convergence using R-hat, trace plots, autocorrelation, and effective sample size
- Common MCMC mistakes and how to avoid them
- When to use MCMC versus variational inference and other alternatives
What Is Markov Chain Monte Carlo?
Markov Chain Monte Carlo is a family of algorithms that generate samples from complex probability distributions through simulation rather than exact mathematical computation. The name combines two foundational ideas that work together to make this possible: Markov Chains and Monte Carlo methods. Understanding each component separately makes the combined technique far more intuitive.
In the context of Bayesian statistics, MCMC is used specifically to sample from a posterior distribution — the updated belief about model parameters after observing data. The posterior is what Bayesian inference is all about, but computing it exactly requires solving an integral that is almost always intractable. MCMC sidesteps this entirely by constructing a random process that, if run long enough, produces samples whose histogram matches the posterior distribution you want.
Markov Chain Sampling
A Markov Chain is a sequence of random states where each state depends only on the current state, not the entire history — a property called memorylessness. In MCMC, the chain wanders through parameter space according to transition rules specifically designed so that it spends more time in high-probability regions. After enough steps the chain forgets its starting point and its distribution converges to the target posterior.
Monte Carlo Estimation
Monte Carlo methods use random sampling to estimate quantities that are hard to compute directly. Instead of solving an integral analytically, you draw many samples from a distribution and compute the average of a function over those samples. Applied to Bayesian posteriors, this means you can estimate posterior means, standard deviations, and credible intervals from thousands of MCMC samples without ever solving the underlying integral.
Bayesian Posterior Inference
Bayesian inference updates a prior belief about parameters using observed data to produce a posterior distribution. The posterior encodes everything known about parameter values after seeing the data. Computing it exactly requires integrating over all possible parameter values — an integral that grows exponentially harder with model dimensionality. MCMC evaluates only the unnormalized posterior at any given point and lets the chain converge to do the rest.
Convergence Diagnostics
Convergence diagnostics verify that the MCMC chain has stopped being influenced by its starting point and is genuinely sampling from the target posterior. Key tools include the R-hat statistic (comparing multiple chains), trace plots (visual inspection of parameter paths), autocorrelation analysis, and effective sample size. Never trust MCMC results without checking these metrics — a chain can look productive while still exploring the wrong region of parameter space.
Why MCMC Is Needed: The Intractable Integral Problem
Bayes' theorem gives us the posterior distribution as proportional to the prior times the likelihood. The problem is the denominator — the marginal likelihood or evidence — which requires integrating the product of prior and likelihood over every possible parameter value. For a model with two parameters this is a two-dimensional integral. With ten parameters it is ten-dimensional, and with a neural network it can have millions of dimensions.
High-dimensional integration is computationally intractable. Traditional numerical methods like grid approximation require evaluating the function at exponentially many grid points as dimensions increase. For even moderately complex models, exact computation is simply not feasible on any current or foreseeable hardware.
MCMC elegantly avoids this problem. It does not need to compute the normalizing constant. The acceptance decisions in algorithms like Metropolis-Hastings use only the ratio of unnormalized posteriors — the normalizing constant cancels out. This means MCMC can work with any model where you can evaluate the prior and likelihood pointwise, regardless of how complicated the posterior surface is or how many dimensions the parameter space has.
Never Trust MCMC Results Without Checking R-hat and Trace Plots
A chain that has not converged will produce samples from the wrong distribution — potentially far from the true posterior. Always verify R-hat values below 1.01 across all parameters and inspect trace plots for stationarity before drawing any conclusions from MCMC output. Skipping diagnostics is the single most dangerous mistake in applied Bayesian inference.
How the Core MCMC Algorithm Works: 5 Steps
At the heart of every MCMC algorithm is the same five-step loop. Different algorithms differ in how they implement the proposal mechanism in step two, but the overall structure is universal. Understanding this loop is the foundation for understanding any specific MCMC algorithm.
Understand Markov Chains and Monte Carlo
Before writing any code, internalize the two components. A Markov Chain is a random process where each step depends only on the current position, not the full history. This memoryless property is what allows the chain to eventually become independent of its starting point and reflect only the target distribution. Monte Carlo integration estimates integrals by averaging function values over many random samples. MCMC combines them: a Markov Chain generates the samples, and Monte Carlo integration uses those samples to compute posterior summaries. The key insight is that you can construct a chain whose stationary distribution is exactly the posterior you want — without ever computing the normalizing constant.
Set Up PyMC Model
Install PyMC with pip install pymc. Define your model inside a with pm.Model() context manager. Every distribution you declare inside this block automatically becomes part of the model graph. Specify your prior — your belief about parameter values before seeing data — using distributions like pm.Beta, pm.Normal, or pm.Exponential. Then declare your likelihood — how the data was generated given those parameters — using an observed argument that ties the distribution to your actual data array. PyMC handles the automatic differentiation and model structure automatically from this context.
Run the MCMC Sampler
Call pm.sample(draws, tune=tune_steps, return_inferencedata=True) inside the model context. The draws parameter controls how many posterior samples are collected per chain. The tune parameter controls the burn-in period — the early samples discarded while the sampler adapts its step size and learns the geometry of your posterior. PyMC defaults to the NUTS (No-U-Turn Sampler) algorithm, which is Hamiltonian Monte Carlo with automatic tuning. Running multiple chains from different starting points is standard practice: it lets you detect convergence failures that a single chain would miss. Use at least 2 chains; 4 is standard for production models.
Diagnose Convergence
Before trusting any results, run diagnostics using the ArviZ library. Call az.plot_trace(trace) to generate trace plots — visual inspection of parameter values across iterations. A healthy trace looks like white noise around a stable mean with no trends, flat stretches, or slow drift. Call az.summary(trace) to get numeric diagnostics: check that R-hat is at or below 1.01 for every parameter, and that ESS (effective sample size) is above 200. High autocorrelation or R-hat values above 1.05 indicate convergence failure. Investigate and re-run before proceeding.
Interpret Posterior Results
Once convergence is confirmed, extract posterior summaries from the trace object. The posterior mean is the single-number estimate of each parameter. The standard deviation quantifies uncertainty. The 94% highest density interval (HDI) is the Bayesian analog of a confidence interval — the shortest interval containing 94% of the posterior probability mass. Call az.plot_posterior(trace) for a visual summary of the posterior distribution. For the coin-flip example with 7 heads in 10 flips, a well-run sampler produces a posterior mean of approximately 0.642 with SD of 0.124 and R-hat of 1.00, confirming that the chain converged and the posterior correctly encodes the observed evidence.
The Four Core MCMC Algorithms
Different MCMC algorithms differ primarily in how they propose new states (step 2 of the core loop). The choice of algorithm has a massive impact on efficiency, especially as model complexity grows. Understanding the strengths and weaknesses of each algorithm helps you pick the right tool for a given problem.
Metropolis Algorithm
The Metropolis algorithm is the simplest MCMC algorithm and the historical foundation of the field. It proposes a new state by adding random noise (typically from a symmetric distribution like a Gaussian) to the current parameter values. If the proposed state has higher posterior probability than the current state, it is always accepted. If the proposed state has lower probability, it is accepted with probability equal to the ratio of the proposed to current probability — meaning even downhill moves are sometimes accepted. This stochastic acceptance is what allows the chain to escape local modes and properly explore the full posterior.
The Metropolis algorithm works well for low-dimensional models but breaks down as models grow. Its symmetric proposal distribution means it cannot adapt to the shape of the posterior, and in high dimensions almost every proposal lands in a low-probability region. The rejection rate climbs toward 100%, and the effective sample rate collapses. For any model with more than a handful of parameters, more sophisticated algorithms are needed.
Metropolis-Hastings Algorithm
Metropolis-Hastings generalizes Metropolis by allowing asymmetric proposal distributions. The acceptance probability is adjusted by the ratio of proposal probabilities in both directions, which corrects for the asymmetry and ensures the chain still converges to the correct stationary distribution. This correction term is the Hastings correction factor.
Metropolis-Hastings is the foundation for most modern MCMC extensions. The proposal distribution can now be any valid probability distribution, not just a symmetric Gaussian. This flexibility enables adaptive proposals that learn the local geometry of the posterior during sampling. However, manually tuning the proposal distribution for a complex model remains challenging, which is why gradient-based methods have become dominant in practice.
Gibbs Sampling
Gibbs Sampling takes a fundamentally different approach: instead of proposing a new joint state for all parameters, it updates each parameter one at a time by sampling from its full conditional distribution — the distribution of that parameter given all other parameters at their current values. Crucially, Gibbs sampling has no accept/reject step. Every proposed value is accepted because it is drawn directly from the correct conditional distribution.
Gibbs sampling is highly effective when the full joint posterior is intractable but the conditional distributions for each variable are tractable and easy to sample from — a property that holds for many conjugate models. It is the algorithm underlying many classic Bayesian hierarchical models. Its weakness is that it updates parameters sequentially, which can lead to slow mixing when parameters are strongly correlated. The chain may get stuck moving in the correlated directions and take a very long time to explore the full posterior.
Hamiltonian Monte Carlo and NUTS
Hamiltonian Monte Carlo (HMC) is a quantum leap in MCMC efficiency for continuous parameter spaces. Instead of proposing nearby states through random perturbations, HMC uses the gradient of the log-posterior to simulate Hamiltonian dynamics — the physics of a particle rolling along a potential energy surface defined by the negative log-posterior. By following the gradient, HMC can propose states that are far from the current position but still have high probability, dramatically reducing the number of rejected proposals and the correlation between successive samples.
HMC requires computing gradients, which means the model must be differentiable. This is not a limitation in practice for most statistical models, and modern probabilistic programming frameworks like PyMC and Stan use automatic differentiation to compute these gradients automatically. HMC requires setting two tuning parameters: the step size and the number of leapfrog steps. Choosing these manually is difficult, which historically limited adoption.
The No-U-Turn Sampler (NUTS) solves the tuning problem automatically. NUTS detects when the simulated trajectory starts to double back on itself (making a U-turn) and stops the simulation at that point, dynamically choosing the optimal number of leapfrog steps for each sample. Combined with dual-averaging for step size adaptation during warm-up, NUTS makes HMC fully automatic. PyMC uses NUTS as its default sampler. For any complex, high-dimensional model, NUTS is almost always the best choice.
| Algorithm | Method | Best For | Limitation |
|---|---|---|---|
| Metropolis | Random walk with symmetric proposal; accept/reject based on probability ratio | Learning MCMC fundamentals; very simple low-dimensional models | Inefficient in high dimensions; high rejection rates; cannot adapt to posterior geometry |
| Metropolis-Hastings | Generalizes Metropolis with asymmetric proposals corrected by Hastings factor | Custom proposal distributions; foundation for many modern extensions | Still requires manual proposal tuning; no automatic adaptation to posterior shape |
| Gibbs Sampling | Updates each variable sequentially from its full conditional distribution; no accept/reject | Conjugate models where full conditionals are analytically tractable | Slow mixing with correlated parameters; sequential updates cause dependency |
| HMC / NUTS | Gradient-based Hamiltonian dynamics propose distant high-probability states; NUTS auto-tunes | Complex, high-dimensional continuous models; production Bayesian inference | Requires differentiable model; computationally expensive per step (but far fewer steps needed) |
Three Critical Diagnostic Concepts: Burn-In, Convergence, and Mixing
Understanding what can go wrong with MCMC is just as important as understanding how it works. Three concepts capture the most common failure modes: burn-in contamination, failure to converge, and poor mixing. Each has a distinct cause and a distinct diagnostic signature.
Burn-In
When the MCMC chain starts, its initial position in parameter space is typically chosen arbitrarily — often by random initialization or an optimization-based starting point. The early samples from the chain reflect this arbitrary starting position rather than the target posterior distribution. This initial transient phase, where the chain is moving toward but has not yet reached the high-probability region of the posterior, is called burn-in.
Burn-in samples must be discarded before computing any posterior summaries. Including them would bias your estimates toward the arbitrary starting point. In PyMC, the tune parameter to pm.sample() controls the number of tuning steps that are automatically discarded. The default is 1000. If your trace plots show the chain starting in an unusual region and taking many iterations to find the bulk of the posterior, you may need to increase this value.
The visual diagnostic for burn-in is straightforward: in a trace plot, look for an initial phase where the parameter values are moving systematically in one direction before settling into a stationary regime. If you see this, the burn-in period was too short and you should re-run with a larger tune value.
Convergence
Convergence means the chain has reached a state where it is sampling from the target posterior distribution rather than being influenced by its starting point. Technically, a chain converges when its distribution matches the stationary distribution of the Markov chain — which is designed to be the posterior.
Convergence cannot be proven definitively from a finite chain. It can only be assessed diagnostically. The standard approach is to run multiple chains from different, widely separated starting points and compare them. If all chains are exploring the same region of parameter space and producing similar posterior estimates, this provides strong evidence of convergence. If different chains are exploring different regions, convergence has failed.
The Gelman-Rubin R-hat statistic formalizes this multi-chain comparison. It compares the variance within each chain to the variance across chains. When all chains have converged to the same distribution, within-chain variance and across-chain variance are equal, and R-hat equals 1.0. Values above 1.01 indicate that the chains are still exploring different regions. Values above 1.05 indicate a serious convergence problem that must be resolved before using the results.
Mixing
Mixing describes how effectively the chain explores the target distribution from step to step. A well-mixing chain moves freely throughout the high-probability region of the posterior, taking large steps that are relatively uncorrelated with each other. A poorly-mixing chain gets stuck in local regions, moves in small steps, and generates highly autocorrelated samples.
Poor mixing is dangerous because it produces samples that look like many observations but are actually providing little new information — they are all near the same point. The effective sample size (ESS) measures this: it is the number of approximately independent samples in your chain, accounting for autocorrelation. A chain of 4000 samples with high autocorrelation might have an ESS of only 50, meaning it is equivalent to just 50 independent draws from the posterior.
Target at least 200 effective samples per parameter. If ESS is below this threshold, either increase the number of draws, switch to a more efficient algorithm (NUTS if not already using it), or reparameterize the model to reduce posterior correlations. High-correlation posteriors are the most common cause of poor mixing.
Evaluating MCMC Results: Diagnostic Metrics in Detail
Trace Plots
A trace plot displays the sampled value of a parameter on the vertical axis and the iteration number on the horizontal axis. In ArviZ, az.plot_trace(trace) shows the trace alongside the estimated posterior density for each parameter.
A healthy trace looks like white noise — random fluctuations around a stable mean with no discernible trend, no flat stretches, and no slow drift in one direction. Multiple chains should overlap heavily and be indistinguishable from each other in the trace plot.
Red flags in trace plots: a visible trend (chain still moving toward the posterior), flat stretches (chain is stuck, likely due to rejected proposals), slow drift (autocorrelated samples indicating poor mixing), or chains that occupy different regions (convergence failure). Any of these signals requires investigation before using the results.
Autocorrelation
Autocorrelation measures the correlation between samples separated by a given lag. At lag 1, it measures how similar consecutive samples are. At lag 10, how similar samples ten steps apart are. Ideal MCMC samples have zero autocorrelation at all lags greater than zero. In practice, some autocorrelation is always present, but high autocorrelation at large lags indicates poor mixing.
View autocorrelation with az.plot_autocorr(trace). Bars should drop to near zero quickly — ideally within the first few lags. Bars that remain elevated at lag 10 or beyond are a warning sign. High autocorrelation directly translates to a lower effective sample size.
Effective Sample Size
The effective sample size (ESS) is the number of independent samples that would provide the same amount of information as your actual autocorrelated chain. It is always less than or equal to the actual number of draws. ArviZ reports two ESS variants: bulk ESS (for the main body of the distribution) and tail ESS (for the tails, relevant for credible interval estimation).
The recommended threshold is at least 200 effective samples per parameter, and ideally 400 or more if you need accurate tail estimates. If bulk ESS is high but tail ESS is low, your chain is exploring the center of the posterior well but not the extremes — credible intervals may be unreliable. This is common in heavy-tailed distributions.
R-hat (Gelman-Rubin Statistic)
R-hat (also written as Rhat or the potential scale reduction factor) is the primary diagnostic for convergence when running multiple chains. It is defined as the ratio of the estimated total variance (combining between-chain and within-chain variance) to the mean within-chain variance. When chains have converged to the same distribution, this ratio equals 1.0.
The standard threshold is R-hat below 1.01 for all parameters. Values between 1.01 and 1.05 are a warning — the chains may have converged but the evidence is weak, and more draws are recommended. Values above 1.05 indicate definite convergence failure. The root cause is typically a multimodal posterior, strong posterior correlations, a pathological model specification, or an insufficient tuning period. Address the underlying cause rather than simply running longer chains.
| Metric | What It Measures | Good Values |
|---|---|---|
| Trace Plot | Visual path of parameter values across iterations | White noise around stable mean; multiple chains overlapping; no trends or flat stretches |
| Autocorrelation | Correlation between samples separated by a given number of steps | Drops close to zero within the first few lags (lag 5-10) |
| Effective Sample Size (ESS) | Number of equivalent independent draws, accounting for autocorrelation | 200+ per parameter (400+ for tail estimates) |
| R-hat (Gelman-Rubin) | Ratio of total variance to within-chain variance across multiple chains | At or below 1.01; warning at 1.01-1.05; failure above 1.05 |
Python Implementation with PyMC: Step by Step Guide
The following three code examples walk through a complete Bayesian inference workflow in Python using PyMC. We use a coin-flip model as the running example: given ten coin flips with seven heads, we want to infer the posterior distribution over the coin's bias (probability of heads). This is a simple model with a known analytical solution, which makes it ideal for verifying that the MCMC sampler is working correctly.
Install the required libraries before running any of the code below. PyMC requires Python 3.9 or later and installs ArviZ automatically as a dependency:
# Install first: pip install pymc
import pymc as pm
import numpy as np
# Our observed data: 1 = heads, 0 = tails (7 heads in 10 flips)
observed_flips = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1])
# Define the model inside a context manager
with pm.Model() as coin_model:
# Prior: Beta(2, 2) is a weakly informative prior centered at 0.5
# It expresses a belief that the coin is probably fair but allows for bias
bias = pm.Beta("bias", alpha=2, beta=2)
# Likelihood: each flip is a Bernoulli trial with success probability = bias
# The observed= argument ties this distribution to our actual data
flips = pm.Bernoulli("flips", p=bias, observed=observed_flips)
# At this point the model graph is defined but no sampling has occurred.
# PyMC has automatically constructed the joint log-probability function
# and is ready to compute gradients for NUTS sampling.
The model block defines the complete generative story: the prior on the coin bias and the likelihood of the observed flips given that bias. PyMC builds a computational graph from this specification that supports automatic differentiation, enabling NUTS to compute the gradient of the log-posterior at any point in parameter space.
import arviz as az
# Run NUTS sampler: 2000 posterior draws + 1000 tuning/burn-in steps per chain
# PyMC defaults to 4 chains; return_inferencedata=True returns an ArviZ InferenceData object
with coin_model:
trace = pm.sample(
draws=2000, # Posterior samples to keep per chain
tune=1000, # Tuning steps (discarded burn-in)
return_inferencedata=True,
random_seed=42 # For reproducibility
)
# ── Convergence Diagnostics ──────────────────────────────────────────────────
# Trace plot: inspect parameter paths and marginal posterior density
az.plot_trace(trace, var_names=["bias"])
# Healthy: white noise around stable mean, chains overlapping
# Numeric summary: mean, SD, HDI, ESS, R-hat
summary = az.summary(trace, var_names=["bias"])
print(summary)
# Expected output (approximate):
# mean sd hdi_3% hdi_97% mcse_mean mcse_sd ess_bulk ess_tail r_hat
# bias 0.642 0.124 0.406 0.868 0.001 0.001 2847.0 2913.0 1.0
# R-hat = 1.00: converged
# ESS > 2000: excellent effective sample size
# Posterior mean 0.642: slightly above 0.5, consistent with 7/10 heads
The summary output for 7 heads in 10 flips confirms a well-behaved sampler: posterior mean of 0.642 (SD = 0.124), R-hat of 1.00, and effective sample size above 2000 per chain. The posterior is correctly pulled toward 0.7 (the observed frequency) but tempered by the prior, which provides mild regularization toward 0.5. The 94% HDI runs from approximately 0.41 to 0.87, reflecting the genuine uncertainty about a coin's bias from only ten flips.
# ── Complete MCMC Bayesian Inference Pipeline ────────────────────────────────
# pip install pymc arviz
import pymc as pm
import numpy as np
import arviz as az
# ── 1. Data ──────────────────────────────────────────────────────────────────
observed_flips = np.array([1, 0, 1, 1, 0, 1, 1, 1, 0, 1]) # 7 heads / 10 flips
# ── 2. Model Definition ──────────────────────────────────────────────────────
with pm.Model() as coin_model:
bias = pm.Beta("bias", alpha=2, beta=2) # Prior
flips = pm.Bernoulli("flips", p=bias, observed=observed_flips) # Likelihood
# ── 3. MCMC Sampling (NUTS by default) ──────────────────────────────────────
with coin_model:
trace = pm.sample(draws=2000, tune=1000, return_inferencedata=True, random_seed=42)
# ── 4. Convergence Diagnostics ───────────────────────────────────────────────
summary = az.summary(trace, var_names=["bias"])
print("Posterior summary:")
print(summary)
# Check R-hat programmatically
rhat = summary["r_hat"]["bias"]
ess = summary["ess_bulk"]["bias"]
print(f"R-hat: {rhat:.3f} (pass if <= 1.01)")
print(f"ESS: {ess:.0f} (pass if >= 200)")
if rhat > 1.05:
print("WARNING: Convergence failure — do not use these samples.")
elif rhat > 1.01:
print("WARNING: Marginal convergence — consider more draws.")
else:
print("Diagnostics passed: results are reliable.")
# ── 5. Visual Diagnostics ────────────────────────────────────────────────────
az.plot_trace(trace, var_names=["bias"]) # Trace + marginal density
az.plot_posterior(trace, var_names=["bias"]) # Posterior with HDI
az.plot_autocorr(trace, var_names=["bias"]) # Autocorrelation by lag
# ── 6. Extract Posterior Summaries ──────────────────────────────────────────
posterior_samples = trace.posterior["bias"].values.flatten()
print(f"Posterior mean: {posterior_samples.mean():.3f}")
print(f"Posterior std: {posterior_samples.std():.3f}")
print(f"94% HDI: {az.hdi(posterior_samples, hdi_prob=0.94)}")
# Posterior mean: 0.642 | Std: 0.124 | HDI: [0.406, 0.868]
Key Insight: MCMC Enables Practical Bayesian Inference
MCMC does not compute the posterior distribution — it samples from it. This distinction is what makes Bayesian inference computationally tractable for real-world models. By evaluating only the unnormalized posterior (prior times likelihood) at each proposed point, MCMC avoids the intractable normalizing integral entirely. The chain's stationary distribution is mathematically guaranteed to be the correct posterior, provided it has converged. Modern algorithms like NUTS make this guarantee practically achievable for models with hundreds or thousands of parameters, enabling Bayesian inference to scale from textbook coin-flip examples to complex hierarchical models in epidemiology, finance, and machine learning.
Common MCMC Mistakes and How to Fix Them
Most failures in applied Bayesian inference with MCMC come from the same small set of mistakes. Being aware of them before you encounter them saves significant debugging time.
Mistake 1: Premature Convergence Assumption
The most dangerous mistake is assuming that a chain has converged simply because the sampling completed without errors or because the results look plausible. PyMC always runs to completion regardless of whether convergence occurred. The only way to know whether convergence has been achieved is to check R-hat and examine trace plots. Make these checks an automatic part of every MCMC workflow, not an optional afterthought.
Specific warning signs: divergent transitions reported by PyMC during sampling (printed as "There were N divergences after tuning"), R-hat above 1.01, trace plots showing different chains exploring different regions, or bimodal posterior densities when the model should be unimodal.
Mistake 2: Ignoring Burn-In
Including burn-in samples in posterior summaries biases estimates toward the arbitrary starting point. Always verify that the tune parameter is large enough to cover the full warm-up phase. The default of 1000 tuning steps is usually sufficient for simple models but may be inadequate for complex hierarchical models or strongly correlated posteriors. Inspect the trace plot: if the beginning of the chain shows systematic movement toward the bulk of the posterior before settling into stationary behavior, increase tune.
Mistake 3: Poor Mixing from Wrong Algorithm
Using a random-walk algorithm (Metropolis or Metropolis-Hastings) for a high-dimensional continuous model will almost always result in poor mixing. The rejection rate climbs toward 100% and effective sample sizes collapse. Switch to NUTS. PyMC uses NUTS by default, but users can accidentally override this with step=pm.Metropolis(). For any continuous model with more than two or three parameters, NUTS is the correct default choice.
If poor mixing persists even with NUTS, the problem is usually a highly correlated posterior. Reparameterizing the model to decorrelate parameters — for example, using non-centered parameterizations for hierarchical models — is the most effective remedy. Increasing the number of draws is not a substitute for addressing the underlying correlation structure.
Mistake 4: Insufficient Samples and Low ESS
A common error is running 500 draws per chain and then trusting 95% credible intervals computed from those samples. With high autocorrelation, 500 nominal samples might represent fewer than 50 effective samples — far too few for reliable tail estimates. Always check ESS before reporting credible intervals. For tail-based quantities like 95% HDI, tail ESS is the relevant metric. Aim for at least 400 tail ESS. If ESS is low, first try reparameterization; if that does not help, increase draws.
MCMC vs Alternative Inference Methods
MCMC is not the only approach to Bayesian inference. Understanding when alternative methods are preferable helps you make informed architectural decisions for your modeling pipeline.
| Method | Speed | Accuracy | Best Use Case |
|---|---|---|---|
| Rejection Sampling | Slow; exponentially worse with dimensions | Exact (given enough samples) | Low-dimensional distributions with known upper bound; not practical for real models |
| Importance Sampling | Fast in low dimensions | Good in low-D; variance explodes with dimensionality | Low-dimensional posteriors; reweighting existing sample sets |
| Variational Inference (VI) | Very fast; scales to large datasets | Approximate; can miss posterior structure and underestimate variance | Large-scale models where speed matters more than exactness; real-time inference |
| MCMC (NUTS) | Slow; does not scale well to large N | Asymptotically exact; captures full posterior structure | Research, scientific inference, small-to-medium datasets where accuracy is critical |
Use MCMC when accuracy matters more than speed: scientific inference, clinical trials, small-to-medium datasets, hierarchical models, or any context where underestimating uncertainty has serious consequences. Use Variational Inference when scaling matters more than exactness: production machine learning systems, real-time applications, large datasets with millions of observations, or exploratory work where approximate results are acceptable as a first pass. For many practical workflows, the right approach is to develop and validate the model with MCMC and then switch to VI for production deployment if inference latency becomes a bottleneck.
Python Libraries for MCMC
Three libraries cover the full range of MCMC use cases in Python, from educational implementations to high-performance production systems.
PyMC is the most accessible option for most users. It provides a high-level model specification API, uses NUTS by default, integrates natively with ArviZ for diagnostics, and supports a wide range of distributions and model types. The context-manager syntax makes model definition intuitive, and automatic differentiation via Aesara (and PyTensor in newer versions) handles gradient computation transparently. PyMC is the best starting point for anyone learning Bayesian inference or building moderately complex models.
CmdStanPy and PyStan provide Python interfaces to Stan, which compiles models to high-performance C++. Stan's HMC/NUTS implementation is arguably the most mature and battle-tested available, with extensive diagnostics and robust handling of difficult posteriors. The compilation step adds overhead for development but results in faster sampling for complex models. Stan's modeling language has a steeper learning curve than PyMC but offers precise control over model specification.
NumPy from scratch is the right choice for learning. Implementing the Metropolis algorithm in pure NumPy — a proposal step, an acceptance test, a loop — provides an understanding of MCMC mechanics that no higher-level library can substitute. If you have never implemented an MCMC sampler manually, doing so once with a simple model is strongly recommended before using PyMC or Stan for production work.
Frequently Asked Questions
What is the difference between MCMC and variational inference for Bayesian inference?
MCMC generates samples from the true posterior distribution (asymptotically exact), while variational inference approximates the posterior with a simpler family of distributions by solving an optimization problem. MCMC is more accurate and captures complex posterior shapes including multimodality and heavy tails, but is slow and does not scale to large datasets. Variational inference is fast and scalable but introduces approximation error and can miss posterior structure. Use MCMC when accuracy is critical; use VI when speed and scale matter more.
How many MCMC samples do I need for reliable Bayesian inference?
The relevant metric is effective sample size (ESS), not the raw number of draws. Target at least 200 bulk ESS per parameter for posterior means and standard deviations, and at least 400 tail ESS for credible interval estimates. With NUTS and a well-specified model, 2000 draws across 4 chains typically yields ESS well above these thresholds. Always check ESS in the az.summary() output — high autocorrelation can reduce a 4000-sample chain to fewer than 50 effective samples.
What does an R-hat value above 1.01 mean in MCMC diagnostics?
R-hat (the Gelman-Rubin statistic) above 1.01 means your multiple MCMC chains are not exploring the same region of parameter space — a sign that convergence has not been achieved. Values between 1.01 and 1.05 are a warning requiring more draws or model reparameterization. Values above 1.05 indicate definite convergence failure. Root causes include multimodal posteriors, strong parameter correlations, an insufficient tuning period, or a misspecified model. Never use MCMC results when any parameter has R-hat above 1.05.
Why does PyMC use NUTS instead of Metropolis-Hastings by default?
NUTS (No-U-Turn Sampler) uses gradient information to propose distant but high-probability states, achieving effective sample sizes orders of magnitude higher than Metropolis-Hastings on the same compute budget for models with more than a few parameters. Metropolis-Hastings uses random-walk proposals that become increasingly inefficient as dimensionality grows. NUTS also automatically tunes its step size and trajectory length during warm-up, eliminating the need for manual tuning that made HMC difficult to use in practice. The additional cost per step is more than offset by needing far fewer total steps.
What is burn-in in MCMC and how long should it be?
Burn-in is the initial phase of MCMC sampling where the chain is still being influenced by its arbitrary starting position rather than reflecting the target posterior. These early samples must be discarded. In PyMC, the tune parameter controls the number of tuning steps that are automatically removed. The default of 1000 tuning steps is sufficient for most simple models. For complex hierarchical models or poorly-mixing posteriors, 2000 to 5000 tuning steps may be needed. Inspect trace plots: if the beginning shows systematic movement before settling, increase the tune parameter.
Need Expert Help with AI and Machine Learning?
Our AI and ML consultants can help you design Bayesian models, implement MCMC pipelines in Python, interpret posterior results, and integrate probabilistic inference into your data science workflows and production systems.
About the author
Co-founder & AI Practice Lead, Braincuber Technologies
Co-founder at Braincuber. Builds production AI agents (Anthropic Claude, OpenAI, AWS Bedrock) for US fintech, healthcare, and retail clients with SOC 2 Type II / HIPAA-scope deployments. Joins every architecture review personally.
