How to Predict FIFA World Cup 2026 Winners with Machine Learning
Predicting the FIFA World Cup 2026 winner with machine learning is one of the most compelling real-world MLOps challenges: a tournament with 48 teams, 104 matches, limited historical data, and a bi-hourly retraining requirement during the competition itself. This complete tutorial is a step by step guide and beginner guide to building the full pipeline from scratch — from ingesting raw match statistics and Elo ratings, through feature engineering, model evaluation across 10 model families, Monte Carlo tournament simulation, and automated Google Cloud Run deployment with MLflow experiment tracking. Whether you are a data scientist curious about sports analytics or an ML engineer wanting a production-grade MLOps reference architecture, this step by step guide covers every layer: data versioning with DVC, leakage-safe feature design, Ranked Probability Score evaluation, and the dual-mode frozen vs. per-round retraining experiment that tests whether continuous learning actually improves tournament prediction accuracy.
What You'll Learn:
- How to design a Medallion data architecture (Bronze, Silver, Gold) for sports match data using DVC
- How to engineer leakage-safe Elo features that dominate model importance by a factor of 100x
- How to evaluate 10 model families — including XGBoost, Bayesian Poisson, and deep learning — using Ranked Probability Score (RPS)
- How to build a Monte Carlo simulation engine that runs 10,000 full-tournament simulations encoding the 2026 FIFA group format
- How to deploy an automated retraining pipeline on Google Cloud Run that refreshes bi-hourly during the tournament
- How to track experiments, model artifacts, and promotion decisions with MLflow and DagsHub
- Why deep learning underperforms with limited match data and why the top 5 models are statistically indistinguishable
- How to interpret the June 2026 predictions: Spain and Argentina at ~16%, USA as a coin-flip qualifier at 54.6%
System Overview: A Real-World MLOps Pipeline
This is not a toy notebook. The system described in this tutorial is a production MLOps pipeline designed to remain accurate throughout a live tournament. It ingests data from two external APIs, versions every transformation with DVC, evaluates models on a carefully designed holdout set, and auto-retriggers retraining whenever new match results become available. During the 2026 FIFA World Cup itself, the pipeline executes on a bi-hourly Cloud Scheduler trigger — automatically pulling new results, rebuilding the data pipeline, retraining three model variants, generating fresh predictions, running 10,000 Monte Carlo simulations, and scoring completed predictions against a baseline, all without human intervention.
The architecture is built on three foundational design choices: the Medallion pattern for data organization, Elo ratings as the dominant feature signal, and predicting goals scored rather than match outcomes. Predicting a scoreline distribution rather than a win/draw/loss label enables full probabilistic tournament simulation — you can simulate any knockout bracket by sampling from the predicted scoreline distributions for every possible matchup.
Medallion Data Pipeline
A three-layer Bronze-Silver-Gold architecture organizes raw API ingestion, cleaning, and model-ready feature extraction into fully reproducible stages managed by DVC. Every data snapshot is versioned so any historical model run can be reproduced exactly. The pipeline covers approximately 6,900 international matches since 2018 from two sources: API-Football for match statistics and eloratings.net for team Elo ratings.
XGBoost Champion Model
XGBoost achieves the best Ranked Probability Score (0.18289) on the 347-match holdout spanning the 2022 World Cup and six major tournaments. It is trained with time-decay weighting so recent matches carry more influence. The top five models are within 0.0011 RPS of each other, proving that data quality and feature engineering — not algorithm choice — determines the performance ceiling.
Monte Carlo Simulation
The simulation engine pre-generates predictions for all possible team matchups, then runs 10,000 full tournament simulations encoding the 2026 format: 12 groups of 4 teams, top-2 advance automatically, 8 best third-placed teams advance via the 495-combination FIFA mapping. Each run plays from group stage through the final, applying home-advantage adjustments for USA, Canada, and Mexico matches.
Automated MLOps Retraining
A Google Cloud Run service triggered by Cloud Scheduler checks for new match results and, if found, executes the full pipeline: DVC data rebuild, retraining of XGBoost, Bivariate Poisson, and Bayesian Poisson models, prediction generation, Monte Carlo simulation, and baseline scoring. Runs daily pre-tournament and bi-hourly during the competition. All runs are tracked in MLflow on DagsHub with automated model promotion.
Step 1: Design the Data Architecture — Medallion Pattern
The Medallion pattern organizes data into three named layers — Bronze, Silver, and Gold — each with a clearly defined responsibility. This separation is critical for reproducibility: you can always trace a model artifact back to the exact data snapshot and transformation code that produced it. DVC manages every layer as a versioned stage in a pipeline DAG, so dvc repro rebuilds only the stages that have changed since the last run.
Bronze Layer — Raw Ingestion
The Bronze layer stores raw data exactly as received from external sources, with no modifications. Two sources feed the pipeline. API-Football provides structured match-level statistics: scores, competition names, match dates, venues, and team identifiers for every recorded international fixture. eloratings.net provides team Elo ratings — a numerical strength metric derived from historical results — updated after each match. The pipeline covers approximately 6,900 international matches since 2018, giving sufficient recent history while avoiding the structural shifts from pre-modern football. Raw files are stored as JSON and CSV respectively, versioned via DVC remote storage so any collaborator can reproduce the exact dataset used for a given model run.
Silver Layer — Cleaning and Standardization
The Silver layer applies team name mapping, schema validation, and data cleaning. Team names are inconsistent across sources — "United States" vs. "USA" vs. "United States of America" all refer to the same national team. A canonical name mapping resolves these discrepancies. Schema validation enforces expected data types and ranges, catching API response changes before they corrupt downstream stages. The Silver layer outputs a single cleaned, standardized table of matches with consistent team identifiers, competition classifications, and Elo ratings joined from the ratings source at the match date.
Gold Layer — Model-Ready Features
The Gold layer transforms the cleaned Silver data into a one-row-per-match feature table ready for model training. Every feature in this layer is designed to be computable from information available before the match kicks off — a strict leakage-safety requirement. Rolling statistics use only pre-match history. Elo ratings used are the ratings as of the match date, before the outcome updates them. The Gold stage is the most computationally expensive layer and benefits most from DVC caching: if the Silver data and feature engineering code have not changed, dvc repro skips this stage entirely.
stages:
bronze_matches:
cmd: python src/data/ingest_matches.py
deps: [src/data/ingest_matches.py, params.yaml]
outs: [data/bronze/matches_raw.json]
bronze_elo:
cmd: python src/data/ingest_elo.py
deps: [src/data/ingest_elo.py, params.yaml]
outs: [data/bronze/elo_ratings_raw.csv]
silver:
cmd: python src/data/build_silver.py
deps:
- src/data/build_silver.py
- data/bronze/matches_raw.json
- data/bronze/elo_ratings_raw.csv
- src/data/team_name_mapping.json
outs: [data/silver/matches_clean.parquet]
gold:
cmd: python src/features/build_gold.py
deps:
- src/features/build_gold.py
- data/silver/matches_clean.parquet
outs: [data/gold/features.parquet]
train:
cmd: python src/models/train_all.py
deps:
- src/models/train_all.py
- data/gold/features.parquet
outs:
- models/xgboost_home.pkl
- models/xgboost_away.pkl
- models/bivariate_poisson.pkl
- models/bayesian_poisson_trace.nc
Step 2: Engineer Features — 6 Categories Designed for Leakage Safety
Feature engineering is where the domain knowledge lives in this system. Six feature categories feed the model, and every single one is computed exclusively from information available before the match begins. This leakage-safety requirement is non-negotiable: if you accidentally include any post-match information in a training feature, your model will appear to perform extraordinarily well on the training set but will fail entirely at inference time when that information is unavailable.
Design Data Architecture
Set up the Medallion pattern with DVC. Initialize a DVC repository, configure remote storage (GCS bucket), and create the Bronze-Silver-Gold directory structure. Write ingestion scripts for API-Football and eloratings.net, a cleaning script for the Silver layer, and a feature engineering script for the Gold layer. Define the full pipeline in dvc.yaml and run dvc repro to build all stages. Verify reproducibility by checking that a second dvc repro run completes instantly with no stages re-executed. This confirms that every stage is properly tracked and cached.
Engineer Features
Implement the six feature categories in the Gold build script. Compute Elo Difference (home team Elo minus away team Elo at match date), Elo Sum (combined rating indicating fixture quality), and Rolling Elo Change (form over the last 5 matches for each team). Add Rolling Goals For and Against (recent offensive and defensive output over the last 5 matches per team). Encode Match Context features: competition tier (World Cup qualifier, continental championship, friendly), knockout status (group stage, round of 16, final), and venue type (home, away, neutral). Enforce strict temporal ordering in all rolling calculations — no future data can leak backward across the sort index.
Select and Evaluate Models
Train all 10 model candidates on pre-2022 World Cup data with time-decay weighting so recent matches carry proportionally more influence during training. Evaluate every model on the 347-match holdout set covering the 2022 World Cup plus six major tournaments. Use Ranked Probability Score (RPS) as the evaluation metric — lower is better. RPS is specifically designed for ordinal outcomes like goals scored, penalizing confident wrong predictions more severely than uncertain ones. Log all runs to MLflow with data snapshots, training durations, holdout RPS, and model artifacts. Promote the lowest-RPS model to production automatically via code-based model registration.
Build Monte Carlo Simulation
Pre-generate predicted scoreline distributions for every possible team pair among all 48 qualified nations — that is all 48×47/2 = 1,128 directed matchups. Store these as a lookup table. Encode the 2026 format: 12 groups of 4 teams with top-2 automatic advancement plus 8 best third-placed teams advancing via the 495 possible third-place combination mappings defined by FIFA's official bracket allocation rules. Apply home-advantage adjustments for matches played in USA, Canada, and Mexico venues. Execute 10,000 simulation runs, each playing the full tournament from group stage through the final by sampling from the predicted scoreline distributions. Aggregate results to compute championship probabilities for each of the 48 teams.
Deploy Production Pipeline
Package the full pipeline as a Google Cloud Run container. Configure two Cloud Scheduler triggers: a daily trigger for the pre-tournament period and a bi-hourly trigger for the active competition window. The container startup logic checks for new match results first — if no new data is available, the run exits immediately without retraining, keeping costs minimal. When new results are found, it executes the full sequence: dvc repro to rebuild Silver and Gold layers, retrain the three production models, generate fresh fixture predictions, run Monte Carlo, and score completed predictions against the mean-rate Poisson baseline. All execution metadata is logged to MLflow.
Track with MLflow
Configure MLflow tracking on DagsHub for full experiment auditability. Log exact data snapshots (row counts and timestamps for Bronze, Silver, and Gold stages), run mode tags (frozen vs. per-round), lifecycle stage tags, holdout RPS as the primary selection metric, and training duration for each model — particularly important because Bayesian MCMC is approximately 100 times slower than XGBoost and Random Forest. Store model artifacts and prediction CSV files as MLflow artifacts so any previous run's outputs can be reproduced. Use code-based model promotion: the script automatically registers the lowest-RPS model from the current run as the production version in the MLflow Model Registry.
Model Selection: Evaluating 10 Candidates
Ten model candidates across six families were evaluated on the same training and holdout split. The goal is not just to find the best model — it is to understand the performance ceiling imposed by the data itself and to identify where algorithmic improvements can no longer help. The evaluation protocol uses time-based splitting (training on pre-2022 World Cup data) to prevent temporal leakage, and time-decay weighting so that the model reflects the current strength of teams rather than their historical average.
Each model predicts a goals scored distribution for both teams in a match, not just a win/draw/loss outcome. This is the key design decision that enables tournament simulation: if you only predict match outcomes, you cannot simulate a full knockout tournament because you need scoreline distributions to determine which team advances. Predicting goals (typically modeled as Poisson or negative binomial counts) gives you a full probability distribution over all possible score combinations, from which win/draw/loss probabilities and tournament advancement probabilities naturally follow.
| Model | Family | RPS Score | Notes |
|---|---|---|---|
| XGBoost | ML | 0.18289 | Champion model; handles non-linear Elo interactions; fast retraining (~seconds) |
| Bayesian Poisson (MCMC) | Bayesian | 0.18316 | Full posterior uncertainty; ~100x slower than XGBoost; 0.00027 RPS gap from champion |
| Negative Binomial | Statistical | 0.18373 | Handles overdispersion in goals data better than standard Poisson |
| Bivariate Poisson | Statistical | 0.18389 | Models home/away goals jointly with correlation term; interpretable parameters |
| Random Forest | ML | 0.18392 | Good baseline; all top-5 within 0.0011 RPS — statistically indistinguishable |
| SARIMAX | Time Series | 0.18541 | Captures seasonality; limited by irregular international match schedule |
| Ridge Regression | ML | 0.18607 | Strong linear baseline; regularization prevents overfitting with limited features |
| Mean-Rate Poisson | Baseline | 0.18891 | Competition baseline; predicts average goals for all teams regardless of opponent |
| LSTM | Deep Learning | 0.19234 | Underperforms baseline; insufficient data — needs 100k+ rows for effective sequence learning |
| 1D CNN | Deep Learning | 0.19418 | Worst performer; convolutional patterns require dense sequences not present in match schedules |
Deep Learning Underperforms — and the Top 5 Models Are Statistically Indistinguishable
Both deep learning architectures (LSTM and 1D CNN) perform worse than the mean-rate Poisson baseline. With approximately 6,900 training rows — far below the 100,000+ samples deep learning architectures require to generalize reliably — neural networks overfit aggressively. The lesson is definitive: do not apply deep learning to small tabular datasets. Additionally, the top 5 models span only 0.0011 RPS. The performance ceiling is set by the data and features, not the algorithm. Switching from XGBoost to any other top-5 model provides no meaningful real-world improvement.
Key Insight: Elo Difference Dominates at 100x — Data Quality Beats Algorithm Choice
XGBoost feature importance analysis reveals that Elo Difference is approximately 100 times more important than the next most important feature. This single number — the gap between the two teams' Elo ratings before the match — captures the accumulated historical evidence of which team is stronger. Rolling goals, competition tier, and venue type all contribute, but they are secondary signals that refine the Elo-driven prediction rather than driving it. The practical implication is clear: investing effort in high-quality, current Elo ratings yields far more prediction improvement than experimenting with more complex algorithms. Data quality beats model complexity every time at this scale.
Feature Importance: What Actually Drives Predictions
| Feature | Category | Importance Level | Description |
|---|---|---|---|
| elo_diff | Elo Rating | Dominant (~100x) | Home team Elo minus away team Elo at match date; single most informative signal in the dataset |
| elo_sum | Elo Rating | High | Combined rating indicating overall fixture quality — high-profile elite matchups vs. mismatches |
| elo_change_home_5 | Form | Medium | Rolling Elo point change over last 5 home-team matches — measures recent form trajectory |
| elo_change_away_5 | Form | Medium | Rolling Elo point change over last 5 away-team matches — captures hot/cold streaks |
| rolling_goals_for_home_5 | Goals | Medium | Average goals scored by home team in last 5 matches — recent offensive output |
| rolling_goals_against_away_5 | Goals | Medium | Average goals conceded by away team in last 5 matches — recent defensive vulnerability |
| competition_tier | Match Context | Low-Medium | World Cup, major continental championship, qualifier, or friendly — stakes affect team motivation |
| is_knockout | Match Context | Low | Binary knockout stage flag — elimination matches tend toward lower-scoring, more conservative play |
| venue_type | Match Context | Low | Home, away, or neutral venue; home advantage adjustment applied separately in Monte Carlo for 2026 hosts |
The Monte Carlo Simulation Engine in Detail
The Monte Carlo simulation is what transforms a set of match predictions into tournament winner probabilities. A single match prediction tells you the probability distribution over scorelines for one game. A tournament simulation aggregates thousands of complete tournament paths to compute how often each team lifts the trophy.
Pre-Computing the Match Prediction Lookup Table
Before running any simulations, the system pre-generates predictions for every possible pair among the 48 qualified nations. With 48 teams, there are 48 x 47 = 2,256 directed matchups (home/away designation matters for the home-advantage adjustment applied to USA, Canada, and Mexico matches). Each prediction is a probability distribution over goals scored, stored as a dictionary keyed by the team pair. Pre-computing avoids redundant model inference during the 10,000 simulation runs and reduces total runtime from hours to minutes.
Encoding the 2026 FIFA Format
The 2026 FIFA World Cup introduces a new format: 12 groups of 4 teams, with the top 2 from each group advancing automatically (24 teams total) plus the 8 best third-placed finishers from the 12 groups. Determining which third-placed teams advance and what bracket position they receive is determined by the group letters of the 8 qualifying third-placed teams, following an official FIFA mapping table. There are C(12,8) = 495 possible combinations of third-place qualifiers, each mapping to a specific knockout bracket assignment. The simulation hardcodes the complete FIFA mapping for all 495 combinations, exactly replicating the official bracket construction procedure.
Home Advantage Adjustments
The 2026 World Cup is co-hosted by the United States, Canada, and Mexico. These three nations receive a home-advantage adjustment in their simulated matches. The adjustment modifies the expected goals parameter for the host team upward and the visiting team downward, reflecting the documented empirical boost that playing on home soil provides in international football. The adjustment magnitude is calibrated from historical home vs. neutral venue performance data in the training dataset.
import numpy as np
from scipy.stats import poisson
# Pre-computed scoreline distributions for all team pairs
# prediction_table[(team_a, team_b)] = (lambda_a, lambda_b)
# where lambda = expected goals from XGBoost model
def simulate_match(lambda_home, lambda_away):
"""Sample a single match scoreline from Poisson goal distributions."""
goals_home = np.random.poisson(lambda_home)
goals_away = np.random.poisson(lambda_away)
if goals_home > goals_away:
return 'home'
elif goals_away > goals_home:
return 'away'
else:
# Knockout stage: simulate penalty shootout (50/50 after extra time)
return 'home' if np.random.random() < 0.5 else 'away'
def simulate_group_stage(groups, prediction_table, home_advantage_teams):
"""Simulate all group stage matches and return group standings."""
standings = {}
for group_name, teams in groups.items():
points = {t: 0 for t in teams}
goals_for = {t: 0 for t in teams}
goals_against = {t: 0 for t in teams}
# Each team plays 3 matches (round robin within group of 4)
matchups = [(teams[i], teams[j]) for i in range(4) for j in range(i+1, 4)]
for team_a, team_b in matchups:
lam_a, lam_b = prediction_table[(team_a, team_b)]
# Apply home advantage for host nations playing in their country
if team_a in home_advantage_teams:
lam_a *= 1.12; lam_b *= 0.92
goals_a = np.random.poisson(lam_a)
goals_b = np.random.poisson(lam_b)
goals_for[team_a] += goals_a; goals_against[team_a] += goals_b
goals_for[team_b] += goals_b; goals_against[team_b] += goals_a
if goals_a > goals_b:
points[team_a] += 3
elif goals_b > goals_a:
points[team_b] += 3
else:
points[team_a] += 1; points[team_b] += 1
# Sort by points, then goal difference, then goals for
ranked = sorted(teams, key=lambda t: (
points[t], goals_for[t] - goals_against[t], goals_for[t]
), reverse=True)
standings[group_name] = {
'ranked': ranked,
'points': points,
'gd': {t: goals_for[t] - goals_against[t] for t in teams},
'gf': goals_for
}
return standings
def run_tournament(groups, prediction_table, home_advantage_teams, fifa_third_place_mapping):
"""Run one complete tournament simulation. Returns the champion."""
standings = simulate_group_stage(groups, prediction_table, home_advantage_teams)
# Collect top-2 from each group (24 teams) + 8 best third-placed
top2 = []
third_placed = []
for g, data in standings.items():
top2.extend(data['ranked'][:2])
third_placed.append((data['ranked'][2], g, data['points'][data['ranked'][2]],
data['gd'][data['ranked'][2]], data['gf'][data['ranked'][2]]))
# Rank all third-placed teams; select best 8
third_sorted = sorted(third_placed, key=lambda x: (x[2], x[3], x[4]), reverse=True)
best_8_third = third_sorted[:8]
# Lookup bracket positions from FIFA official third-place combination mapping
groups_of_best_8 = tuple(sorted([t[1] for t in best_8_third]))
bracket_slots = fifa_third_place_mapping[groups_of_best_8]
# Simulate knockout rounds (Round of 32 through Final)
knockout_teams = top2 + [t[0] for t in best_8_third]
# ... knockout bracket simulation continues to Final
return simulate_knockout(knockout_teams, bracket_slots, prediction_table, home_advantage_teams)
# Run 10,000 simulations
N_SIMS = 10_000
championship_counts = {team: 0 for team in all_teams}
for _ in range(N_SIMS):
winner = run_tournament(groups, prediction_table, HOST_NATIONS, FIFA_THIRD_PLACE_MAP)
championship_counts[winner] += 1
# Convert to probabilities
championship_probs = {t: c / N_SIMS for t, c in championship_counts.items()}
# Top results (June 10, 2026 run):
# Spain: ~16.1% Argentina: ~15.8% France: ~12.4%
# England: ~9.7% Brazil: ~8.9% United States: ~2.1%
Production Deployment: Google Cloud Run and Cloud Scheduler
The production deployment uses Google Cloud Run for serverless execution and Cloud Scheduler for automated triggering. Cloud Run scales to zero when not in use, which is ideal for a pipeline that runs once per day pre-tournament and bi-hourly during the competition — you pay only for actual execution time. The container image is built from a multi-stage Dockerfile that separates the Python dependency installation from the pipeline code, keeping image size manageable despite the heavy scientific Python stack.
Dual-Mode Experiment Design
The pipeline runs in two modes simultaneously throughout the tournament, implementing a controlled experiment to measure whether continuous retraining improves prediction accuracy in the live competition context. The Frozen mode locks model parameters at tournament start and never retrains during the competition — it uses the same model weights throughout all 104 matches. The Per-Round mode keeps hyperparameters fixed but retrains model parameters after each matchday in the group stage and after each round in the knockout stage, incorporating new match results as they happen.
By running both modes in parallel and comparing their RPS on upcoming fixtures throughout the tournament, the experiment directly answers the question: does access to more recent match data — including the results of matches already played in the 2026 World Cup itself — improve prediction accuracy for future matches? This has practical implications for all sports prediction systems: if continuous retraining does not help, significant engineering complexity can be eliminated.
"""
Cloud Run pipeline entry point.
Triggered by Cloud Scheduler (daily pre-tournament; bi-hourly during competition).
Exits early if no new match results found — keeps execution cost minimal.
"""
import subprocess, sys, mlflow, time
from datetime import datetime
from src.data.check_new_results import has_new_results
from src.models.train_all import train_all_models
from src.simulation.monte_carlo import run_monte_carlo
from src.evaluation.score_predictions import score_completed_predictions
RUN_MODES = ['frozen', 'per_round']
def run_pipeline(run_mode: str):
start_time = time.time()
mlflow.set_tracking_uri("https://dagshub.com/your-org/wc2026.mlflow")
mlflow.set_experiment(f"wc2026_{run_mode}")
with mlflow.start_run(tags={"run_mode": run_mode, "timestamp": datetime.utcnow().isoformat()}):
# Step 1: Check for new match results — exit early if none
if not has_new_results():
print(f"[{run_mode}] No new results found. Exiting.")
mlflow.log_param("exit_reason", "no_new_results")
return
# Step 2: Rebuild Silver and Gold layers via DVC
print(f"[{run_mode}] Rebuilding data pipeline...")
result = subprocess.run(["dvc", "repro", "--downstream", "silver"], capture_output=True)
if result.returncode != 0:
raise RuntimeError(f"dvc repro failed: {result.stderr.decode()}")
# Step 3: Retrain models (skip in frozen mode after tournament start)
if run_mode == 'per_round':
print(f"[{run_mode}] Retraining models...")
metrics = train_all_models(log_to_mlflow=True)
mlflow.log_metrics(metrics) # Logs holdout RPS for each model
# Step 4: Generate predictions for upcoming fixtures
print(f"[{run_mode}] Generating fixture predictions...")
from src.models.predict_fixtures import predict_upcoming
predictions = predict_upcoming(run_mode=run_mode)
mlflow.log_artifact(f"predictions_{run_mode}_{datetime.utcnow().date()}.csv")
# Step 5: Run Monte Carlo simulation
print(f"[{run_mode}] Running Monte Carlo (10,000 runs)...")
championship_probs = run_monte_carlo(n_sims=10_000, run_mode=run_mode)
mlflow.log_dict(championship_probs, "championship_probabilities.json")
# Step 6: Score completed predictions against baseline
print(f"[{run_mode}] Scoring completed predictions...")
scoring_results = score_completed_predictions(run_mode=run_mode)
mlflow.log_metrics(scoring_results) # model_rps, baseline_rps, delta_rps
elapsed = time.time() - start_time
mlflow.log_metric("pipeline_duration_seconds", elapsed)
print(f"[{run_mode}] Pipeline complete in {elapsed:.1f}s")
if __name__ == "__main__":
for mode in RUN_MODES:
run_pipeline(mode)
sys.exit(0)
MLflow and DagsHub Experiment Tracking
MLflow tracking is central to the system's auditability. Every pipeline execution — whether a pre-tournament baseline run, a per-round retraining, or a frozen-mode update — creates an MLflow run with a complete record of the inputs, outputs, and decisions made during that execution. This makes the system fully auditable: you can always trace a published prediction back to the specific data snapshot, model version, and code commit that produced it.
What Gets Logged in Every Run
Each MLflow run records the following: exact data snapshot information including row counts and last-updated timestamps for Bronze, Silver, and Gold layers; run mode tag (frozen or per-round) and lifecycle stage tag (pre-tournament, group-stage, round-of-16, etc.); holdout RPS for each of the three production models (XGBoost, Bivariate Poisson, Bayesian Poisson) as the primary model selection metric; training duration in seconds for each model — critical because Bayesian MCMC with PyMC takes approximately 100 times longer than XGBoost and Random Forest, which affects scheduling decisions; model artifacts stored as MLflow model objects registered in the Model Registry; prediction CSV files for upcoming fixtures; championship probability JSON for the 48 teams; and pipeline total duration.
Automated Model Promotion
Model promotion is handled by code, not by manual review. At the end of each training run, the script queries the MLflow Model Registry for the currently active Production model and compares its holdout RPS against the newly trained model's RPS. If the new model achieves lower RPS (better performance), it is automatically promoted to Production stage and the previous version is archived. This automated, auditable promotion process eliminates subjective human decisions about model selection and creates a clear record of every production deployment decision and its justification.
Tournament Predictions: June 10, 2026 Results
The Monte Carlo simulation run on June 10, 2026 — shortly before the tournament's opening matches — produces the following championship probability distribution. These numbers reflect the current Elo ratings and recent form of all 48 qualified nations, the 2026 group draw, and the home-advantage adjustments for the three host nations.
| Team | Championship % | Group Escape % | Tier |
|---|---|---|---|
| Spain | ~16.1% | ~91% | Top Favorite |
| Argentina | ~15.8% | ~90% | Top Favorite |
| France | ~12.4% | ~88% | Strong Contender |
| England | ~9.7% | ~86% | Strong Contender |
| Brazil | ~8.9% | ~84% | Strong Contender |
| Colombia | ~5.2% | ~78% | Dark Horse |
| United States | ~2.1% | 54.6% | Coin-Flip Team |
The United States entry in the predictions is the most analytically interesting result. With a 54.6% group escape probability — the 13th-lowest among all 48 teams — and a ~2.1% championship probability — the 13th-highest — the USA presents a striking asymmetry. The model describes them as a "coin-flip team": they are more likely to exit in the group stage than any traditional powerhouse, yet if they do advance, the host nation effect and the favorable potential bracket paths mean they slightly outperform their Elo rating in deep-tournament probability. This is a genuine prediction, not a home bias — it emerges directly from the Elo-driven simulation with the home-advantage adjustment calibrated from historical data.
Spain and Argentina at approximately 16% each reflects both their elite Elo ratings and the inherent uncertainty of a 48-team single-elimination tournament. Even the world's strongest team wins the World Cup in only about 1 in 6 simulations — the tournament format introduces enormous variance, which is exactly what makes it compelling to watch and challenging to predict.
Tech Stack Reference
| Library / Tool | Category | Purpose |
|---|---|---|
| XGBoost | ML | Champion prediction model; gradient boosted trees with time-decay sample weights |
| scikit-learn | ML | Ridge regression baseline, Random Forest model, preprocessing pipelines |
| PyMC | Bayesian | Bayesian Poisson model with MCMC posterior sampling via NUTS |
| statsmodels | Statistical | Bivariate Poisson, Negative Binomial, and SARIMAX model implementations |
| SciPy | Scientific | Poisson PMF computation, Ranked Probability Score calculation, statistical tests |
| Keras / TensorFlow | Deep Learning | LSTM and 1D CNN model implementations (evaluated; not selected for production) |
| pandas / NumPy | Data | Data manipulation, feature engineering, rolling window calculations across the Gold layer |
| DVC | MLOps | Data version control; Medallion pipeline DAG; reproducible dvc repro execution |
| MLflow | MLOps | Experiment tracking, model registry, automated promotion, artifact storage |
| DagsHub | MLOps | Remote MLflow tracking server; Git + DVC repository hosting; team collaboration |
| Google Cloud Run | Infrastructure | Serverless container execution; scales to zero between runs; cost-efficient for scheduled workloads |
| Cloud Scheduler | Infrastructure | Triggers Cloud Run container; daily pre-tournament, bi-hourly during competition window |
| Streamlit | Visualization | Public-facing dashboard displaying championship probabilities, fixture predictions, and bracket simulation |
Key Findings: What This Project Teaches About Practical ML
Beyond the World Cup predictions themselves, this project produces five research-quality findings that generalize to sports prediction and applied ML more broadly.
Finding 1: Algorithm Choice Is Secondary to Data Quality
The top five models — XGBoost, Bayesian Poisson, Negative Binomial, Bivariate Poisson, and Random Forest — span only 0.0011 RPS. On a 347-match holdout, this gap is statistically indistinguishable. No amount of algorithmic sophistication can improve predictions when the training data contains approximately 6,900 rows and the dominant signal (Elo difference) is already a highly compressed, information-dense feature. The performance ceiling is data-imposed, not algorithm-imposed. The practical lesson: before investing in more complex models, invest in better data collection, higher-quality ground truth labels, and more informative feature sources.
Finding 2: Deep Learning Requires Dense, Large-Scale Data
Both LSTM and 1D CNN models performed worse than the naive mean-rate Poisson baseline. International football produces a sparse, irregular time series: teams may play only 4-6 matches per year, with long gaps that make meaningful sequence learning impossible. Neural networks' inductive biases — learning hierarchical feature representations from thousands of examples — are simply the wrong tool for this data regime. The threshold for deep learning to outperform well-engineered statistical and gradient boosting models on tabular data is generally 100,000+ examples. With fewer than 10,000 rows, deep learning reliably overfits and underperforms.
Finding 3: Tournament Matches Are More Predictable Than Regular Fixtures
When analyzing prediction accuracy by competition type, tournament matches (World Cup, Euros, Copa America) proved more predictable than regular friendlies and qualifiers. This counterintuitive result likely reflects that elite team quality is compressed at major tournaments — every team has qualified — eliminating the extreme mismatches that make qualifiers and friendlies harder to predict. Tournament stakes also reduce the influence of squad rotation and motivation differences, making Elo ratings and recent form more reliable predictors of actual performance.
Finding 4: Predicting Goals Enables Tournament Simulation; Predicting Outcomes Does Not
The decision to predict goals scored rather than match outcomes is architecturally foundational, not just a modeling preference. A model that predicts win/draw/loss probabilities cannot simulate a knockout tournament, because you need to know which team actually advances after a drawn match in extra time and penalties. A model that predicts scoreline distributions provides everything needed: you sample a scoreline, determine the winner, and proceed to the next round. This design decision is what makes the full Monte Carlo tournament simulation possible and is the core reason this system can produce champion probability estimates rather than just match-by-match predictions.
Finding 5: Elo Is 100x More Important Than Any Other Feature
XGBoost's feature importance analysis confirms that the Elo difference between the two teams is approximately 100 times more important than the next most important feature (Elo sum), which in turn is significantly more important than any rolling statistics or match context features. Elo ratings are a compressed representation of all historical match outcomes, updated after every game, and calibrated specifically to predict future performance. Their dominance reflects that the most predictive signal in football is team quality — who is the better team? — and Elo provides the most efficient encoding of that information available from public data sources.
Frequently Asked Questions
Why does XGBoost outperform Bayesian Poisson for FIFA World Cup prediction?
XGBoost edges Bayesian Poisson by only 0.00027 RPS — a negligible margin — but does so while training roughly 100 times faster. The difference is not algorithmic superiority: XGBoost handles non-linear interactions between Elo difference and match context features slightly more efficiently than the Poisson's log-linear structure. Both models are essentially at the performance ceiling imposed by the approximately 6,900 available training matches. For production use, XGBoost wins on the compute cost alone; for research use requiring full uncertainty quantification, Bayesian Poisson is equally valid.
What is Ranked Probability Score (RPS) and why use it for football prediction?
Ranked Probability Score measures the accuracy of probabilistic predictions over ordered outcomes — perfect for football goals, which follow a natural ordering (0, 1, 2, 3... goals). Unlike simple accuracy, RPS penalizes confident wrong predictions more severely than uncertain ones, and rewards predictions that assign high probability mass near the correct outcome even when not exactly correct. RPS is lower-is-better and ranges from 0 (perfect) to 1 (maximally wrong). It is the standard evaluation metric in football forecasting research because it respects the ordinal structure of scoreline distributions.
How does DVC help with MLOps reproducibility for this football prediction pipeline?
DVC versions every data file and pipeline stage using content-addressable hashing stored alongside your Git repository. Each dvc repro run checks whether inputs and code have changed before rerunning any stage — if nothing has changed, the cached output is used instantly. This means any past model run can be reproduced exactly by checking out the corresponding Git commit and running dvc checkout. For a tournament pipeline that retriggers bi-hourly, DVC also prevents unnecessary reprocessing when new match data has not yet arrived, keeping Cloud Run execution costs low.
How does the Monte Carlo simulation handle the 495 FIFA third-place qualification combinations?
The 2026 World Cup format has 12 groups, and 8 of the 12 third-placed finishers advance to the knockout round. There are C(12,8) = 495 possible sets of qualifying third-placed groups. FIFA defines a specific bracket position for each third-placed team depending on which combination of groups produced the 8 qualifiers. The simulation hardcodes a complete lookup dictionary mapping all 495 group-letter combinations to their official FIFA bracket slot assignments, exactly mirroring the official competition regulations. Each simulation run determines which 8 third-place groups qualified, looks up the combination in this mapping, and assigns teams to the correct knockout bracket positions.
Does continuous retraining during the FIFA World Cup improve prediction accuracy?
The dual-mode experiment — Frozen (parameters locked at tournament start) vs. Per-Round (retrained after each matchday) — is specifically designed to answer this question with tournament data. The hypothesis is that Per-Round retraining should improve predictions in the knockout stages by incorporating the actual performance data teams showed in the group stage, potentially revealing form changes not captured in the pre-tournament Elo ratings. Whether this advantage materializes depends on whether within-tournament form signals are predictive enough to overcome the noise in limited within-tournament samples. Results will be updated as the 2026 tournament progresses.
Need Expert Help with AI and Machine Learning?
Our AI and ML consultants can help you design production MLOps pipelines, build XGBoost and Bayesian prediction models, implement Monte Carlo simulation engines, and deploy automated retraining systems on Google Cloud — from initial architecture through live production deployment.
About the author
Founder & CEO, Braincuber Technologies
Founder and CEO of Braincuber. Has scoped and shipped 500+ Odoo, AI, and cloud projects for US mid-market and global brands. Takes every founder call personally — no SDR layer between buyers and the people building the system.
