Model calibration and validation

Learning Objectives

  • Understand what calibration and validation mean in decision modeling
  • Recognize when and why calibration is needed
  • Identify the key components of model calibration
  • Describe the different types of model validation and their role in assessing credibility

Evidence for decision analysis

  • Three types of evidence available to inform a decision analysis:
    • Describes mechanisms that relate parameters to modeled outcomes
      Used to design model structure
    • Describes most‑likely values for model parameters
      Used to set model parameter values
    • Describes most‑likely values for modeled outcomes
      Used for model calibration, validation

Terminology

  • Calibration The process of fitting the model to evidence on modeled outcomes by adjusting input parameters

  • Validation Assessing model quality by comparing model predictions to evidence on modeled outcomes

  • Calibration vs. Validation:
    “Calibrate until the model validates…”

Evidence on modeled outcomes (calibration / validation targets)

  • Known ranges of model outputs (e.g., probabilities ∈ [0,1])
  • Empirical data (e.g., survey data on disease prevalence)
  • Estimates from published studies (e.g., mean ± CI)
  • Results from other modeling studies
  • Expert opinion

Calibration

Why calibration?

  • Mathematical models of disease often involve a subset of parameters whose values are unknown/unobservable
    • Common reasons include data, feasibility, ethical, and biological considerations
    • Examples: diseases that have a latent, unobservable stage (TB, COVID etc.)
  • Values for these parameters can be estimated by matching model outputs to observed outcomes
    • Models can produce “data” that can be compared to observed epidemiological or clinical study data

Calibration – definition

  • Adjust input values such that model results match observed outcomes.

Key components of a calibration

  1. Calibration targets – empirical data to be matched by the model.
  2. Goodness‑of‑fit (GoF) – quantitative measure of how well model-predicted outcomes match observed targets.
  3. Search strategy / algorithm – fitting method used to search the parameter space to find the best-fitting parameters.

Goodness‑of‑fit measures

  • Qualitative: no explicit criteria, just eyeball.
  • Quantitative: objective function used to evaluate fit to calibration data.
  • Approaches to scoring model fit:
    • Target windows
    • Distance-based functions (e.g., sum squared errors)
    • Likelihood-based functions

Qualitative (visual fit)

Target windows

Target windows

Distance‑based functions

Minimize sum of squared errors (SSE)

SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 Error = distance between model result and target

Where:
- n: number of calibration targets
- y_i: calibration target i
- \hat{y}_i: model outcome for target i

Potential issues
- How to weight different targets?
- Are some targets more important?
- Are some targets less uncertain?

Likelihood‑based functions

  • Calibration target created based on formal probability model for the data.
    • e.g., prevalence survey data drawn from Binomial distribution
      X \sim \text{Binom}(n, p)
    • e.g., case counts drawn from Poisson distribution
  • If data not available, can proxy likelihood with summary statistics reported in literature (mean, CI)

Probability Mass/Density Function

Likelihood Function

Example: formulating a likelihood

Distance-/Likelihood- Based

Distance-/Likelihood- Based

Search strategies / optimisation algorithms

  1. “Hand calibration” – adjust parameter(s) yourself to improve calibration fit
    • If 1 outcome, and 1-2 parameters, can use sensitivity analysis approach (change parameters systematically to get best fit)
  2. Grid search
  3. Random search
  4. Automated search/Optimization routines – automated approaches to searching for parameter values that optimize a calibration target
  5. Probabilistic approaches (Bayesian calibration) – aim to identify sample of good fitting parameter sets, rather than single best fitting set

Hand calibration

Hand calibration

  • Difficult to search for the exact optimal parameter set (prone to numerical errors)
  • Difficult where multiple parameters influence multiple outcomes of interest

Automated search/optimization routines

  • Use fits of past input values to determine which input values to try next
  • Can get stuck at a local optimum; Need to run multiple times for different starting points
  • Directed methods
    • Nelder-Mead (simplex method)
    • Gradient–descent and others
  • Meta-heuristic algorithms
    • Genetic algorithms
    • Simulated annealing

Nelder–Mead (simplex) – intuition

Three hikers search for the highest point using only elevation: reflect, expand, contract, shrink, stop when converged.

Nelder–Mead (simplex)

Nelder–Mead (simplex)

Nelder–Mead (simplex)

Nelder-Mead: Finding the Highest Point Without a Map

Imagine three hikers searching for the highest point on a hill (optimal solution) using only elevation (function values). They follow these steps:

  1. Start with a Triangle (Simplex): Each hiker stands at a different location and checks their elevation. The lowest hiker (worst point) needs to move.

  2. Reflect the Worst Hiker: Move them across the midpoint of the other two.

  3. Expand if Promising: If the new spot is higher, take a bigger step in that direction.

  4. Contract if Reflection Fails: If the move wasn’t helpful, take a smaller step closer to the midpoint.

  5. Shrink if Stuck: If there is no progress, all hikers move closer together.

  6. Converge & Stop: Once all hikers meet at the same spot, they’ve found the best location.

Probabilistic approaches (e.g. Bayesian calibration)

  • Specify function describing model fit criteria (target windows, SSE, likelihood)

  • Generate multiple parameter sets consistent with model fit criteria

    • Randomly sample from probability distributions for model parameters (one sample = parameter set)
    • Retain parameter sets that fall in target windows
    • OR Retain parameter sets probabilistically, with p(retain) proportional to fit statistic (e.g. likelihood) → Bayesian calibration
  • Use sample of fitted parameter sets to produce results

    • Mean estimate = average across parameter set sample
    • Measures of uncertainty calculated from distribution of results

Prior and posterior distributions for model parameters included in the calibration

Menzies NA, Soeteman DI, Pandya A, Kim JJ. Bayesian Methods for Calibrating Health Policy Models: A Tutorial. Pharmacoeconomics. 2017 Jun;35(6):613–24.

Bayesian calibration example

  • Isoniazid preventive therapy (IPT) can prevent TB among people receiving antiretroviral therapy (ART)
  • We investigated the cost-effectiveness of IPT by CD4 cell counts
  • Microsimulation model parameterized from a real-world IPT expansion program (MDH) in Tanzania

Simulation model

  • Individual-based TB/HIV coinfection model
  • The model simulates:
    • TB infection, progression and recovery
    • TB and HIV treatment the MDH cohort received
    • IPT provision and IPT-induced INH resistance
    • Health outcomes (TB incidence, mortality, LE, and DALYs)
    • Economic outcomes (costs of care provision)

Parameter estimation and calibration

  • Some parameters directly estimated from MDH (CD4 trajectory, loss to follow-up rate, etc.)
  • Other uncertain/unobservable parameters such as TB force of infection and TB progression:

This is exactly why we need calibration

These parameters were estimated using a Bayesian calibration approach – a method that allows us to match the model to real-world outcomes and quantify uncertainty in those estimates.

Bayesian calibration process

  • Example calibrated parameters: TB force of infection, prevalence of LTBI and undiagnosed active TB at baseline…
  • Prior distributions: from published literature
  • Targets: TB incidence and mortality, all-cause mortality from MDH (over time)
  • Algorithm to sample from posterior distribution: Incremental Mixture Importance Sampling (IMIS)
  • Posterior sample captures parameter uncertainty and was used for probabilistic sensitivity analysis to estimate uncertainty intervals in outcomes

Bayesian calibration does not stop at point estimates of parameter values

Calibration results – TB cases

Calibration results – All-cause deaths

Calibration results – TB deaths

Validation

Once a model is developed, how do we assess if the results are credible?

Transparency:
Clearly describing the model structure, equations, parameter values, and assumptions to enable interested parties to understand the model

Validation:
Comparing model results with events observed in reality

Eddy DM, Hollingworth W, Caro JJ, Tsevat J, McDonald KM, Wong JB. Model Transparency and Validation: A Report of the ISPOR-SMDM Modeling Good Research Practices Task Force–7. Med Decis Making. 2012 Sep;32(5):733–43.

Once a model is developed, how do we assess if the results are credible?

Transparency:
Clearly describing the model structure, equations, parameter values, and assumptions to enable interested parties to understand the model

Validation:
Comparing model results with events observed in reality

Eddy DM, Hollingworth W, Caro JJ, Tsevat J, McDonald KM, Wong JB. Model Transparency and Validation: A Report of the ISPOR-SMDM Modeling Good Research Practices Task Force–7. Med Decis Making. 2012 Sep;32(5):733–43.

Validation

Face Validity
The model reflects current scientific understanding, as judged by experts.

Internal Validation (verification)
The model is implemented correctly and behaves as expected (e.g., code verification).

External Validation
Model outputs are compared with empirical observations not used in model development.

Predictive Validation
Can the model reproduce data that wasn’t available during development?

Cross-model Validation
Do different models give similar results for the same question?

Eddy DM, Hollingworth W, Caro JJ, Tsevat J, McDonald KM, Wong JB. Model Transparency and Validation: A Report of the ISPOR-SMDM Modeling Good Research Practices Task Force–7. Med Decis Making. 2012 Sep;32(5):733–43.

Internal validation

External Validation

Predictive validation

Modeled projections of HIV prevalence, South Africa for 2012
→ 10 models fit to earlier data, then compared to 2012 prevalence survey

Eaton, Bacaër et al, Lancet Global Health, 2015

Cross-model validation/model corroboration

Eaton, Bacaër et al, Lancet Global Health, 2015

Cross-model validation/model corroboration

Cross-model validation/model corroboration

Take-Home Messages

Take-Home Messages

  • Calibration aligns model predictions with real-world data by adjusting uncertain parameters
  • Key elements: well-defined calibration targets, goodness-of-fit measures, and a strategy to search the parameter space
  • Bayesian calibration provides a distribution of plausible parameter sets, helping to quantify uncertainty
  • Multiple forms of validation (internal, external, predictive, cross-model) build trust in the model’s outputs

Further Reading