Model calibration and validation

Learning Objectives

Understand what calibration and validation mean in decision modeling
Recognize when and why calibration is needed
Identify the key components of model calibration
Describe the different types of model validation and their role in assessing credibility

Evidence for decision analysis

Three types of evidence available to inform a decision analysis:
- Describes mechanisms that relate parameters to modeled outcomes
  → Used to design model structure
- Describes most‑likely values for model parameters
  → Used to set model parameter values
- Describes most‑likely values for modeled outcomes
  → Used for model calibration, validation

Terminology

Calibration The process of fitting the model to evidence on modeled outcomes by adjusting input parameters
Validation Assessing model quality by comparing model predictions to evidence on modeled outcomes
Calibration vs. Validation:
“Calibrate until the model validates…”

Evidence on modeled outcomes (calibration / validation targets)

Known ranges of model outputs (e.g., probabilities ∈ [0,1])
Empirical data (e.g., survey data on disease prevalence)
Estimates from published studies (e.g., mean ± CI)
Results from other modeling studies
Expert opinion

Calibration

Why calibration?

Mathematical models of disease often involve a subset of parameters whose values are unknown/unobservable
- Common reasons include data, feasibility, ethical, and biological considerations
- Examples: diseases that have a latent, unobservable stage (TB, COVID etc.)
Values for these parameters can be estimated by matching model outputs to observed outcomes
- Models can produce “data” that can be compared to observed epidemiological or clinical study data

Calibration – definition

Adjust input values such that model results match observed outcomes.

Key components of a calibration

Calibration targets – empirical data to be matched by the model.
Goodness‑of‑fit (GoF) – quantitative measure of how well model-predicted outcomes match observed targets.
Search strategy / algorithm – fitting method used to search the parameter space to find the best-fitting parameters.

Goodness‑of‑fit measures

Qualitative: no explicit criteria, just eyeball.
Quantitative: objective function used to evaluate fit to calibration data.
Approaches to scoring model fit:
- Target windows
- Distance-based functions (e.g., sum squared errors)
- Likelihood-based functions

Qualitative (visual fit)

Target windows

Distance‑based functions

Minimize sum of squared errors (SSE)

SSE = \sum_{i=1}^{n} (y_i - \hat{y}_i)^2 Error = distance between model result and target

Where:
- n: number of calibration targets
- y_i: calibration target i
- \hat{y}_i: model outcome for target i

Potential issues
- How to weight different targets?
- Are some targets more important?
- Are some targets less uncertain?

Likelihood‑based functions

Calibration target created based on formal probability model for the data.
- e.g., prevalence survey data drawn from Binomial distribution
  X \sim \text{Binom}(n, p)
- e.g., case counts drawn from Poisson distribution
If data not available, can proxy likelihood with summary statistics reported in literature (mean, CI)

Probability Mass/Density Function

Likelihood Function

Example: formulating a likelihood

Distance-/Likelihood- Based

Search strategies / optimisation algorithms

“Hand calibration” – adjust parameter(s) yourself to improve calibration fit
- If 1 outcome, and 1-2 parameters, can use sensitivity analysis approach (change parameters systematically to get best fit)
Grid search
Random search
Automated search/Optimization routines – automated approaches to searching for parameter values that optimize a calibration target
Probabilistic approaches (Bayesian calibration) – aim to identify sample of good fitting parameter sets, rather than single best fitting set

Hand calibration

Difficult to search for the exact optimal parameter set (prone to numerical errors)
Difficult where multiple parameters influence multiple outcomes of interest

Grid Search

Run model for all possible combinations of parameter values
Often infeasible

Random Search

Randomly sample a large number of parameter value sets from probabilistic distributions

Automated search/optimization routines

Use fits of past input values to determine which input values to try next
Can get stuck at a local optimum; Need to run multiple times for different starting points
Directed methods
- Nelder-Mead (simplex method)
- Gradient–descent and others
Meta-heuristic algorithms
- Genetic algorithms
- Simulated annealing

Nelder–Mead (simplex) – intuition

Three hikers search for the highest point using only elevation: reflect, expand, contract, shrink, stop when converged.

Nelder–Mead (simplex)

Nelder-Mead: Finding the Highest Point Without a Map

Imagine three hikers searching for the highest point on a hill (optimal solution) using only elevation (function values). They follow these steps:

Start with a Triangle (Simplex): Each hiker stands at a different location and checks their elevation. The lowest hiker (worst point) needs to move.
Reflect the Worst Hiker: Move them across the midpoint of the other two.
Expand if Promising: If the new spot is higher, take a bigger step in that direction.
Contract if Reflection Fails: If the move wasn’t helpful, take a smaller step closer to the midpoint.
Shrink if Stuck: If there is no progress, all hikers move closer together.
Converge & Stop: Once all hikers meet at the same spot, they’ve found the best location.

Probabilistic approaches (e.g. Bayesian calibration)

Specify function describing model fit criteria (target windows, SSE, likelihood)
Generate multiple parameter sets consistent with model fit criteria
- Randomly sample from probability distributions for model parameters (one sample = parameter set)
- Retain parameter sets that fall in target windows
- OR Retain parameter sets probabilistically, with p(retain) proportional to fit statistic (e.g. likelihood) → Bayesian calibration
Use sample of fitted parameter sets to produce results
- Mean estimate = average across parameter set sample
- Measures of uncertainty calculated from distribution of results

Prior and posterior distributions for model parameters included in the calibration

Menzies NA, Soeteman DI, Pandya A, Kim JJ. Bayesian Methods for Calibrating Health Policy Models: A Tutorial. Pharmacoeconomics. 2017 Jun;35(6):613–24.

Bayesian calibration example

Isoniazid preventive therapy (IPT) can prevent TB among people receiving antiretroviral therapy (ART)
We investigated the cost-effectiveness of IPT by CD4 cell counts
Microsimulation model parameterized from a real-world IPT expansion program (MDH) in Tanzania

Simulation model

Individual-based TB/HIV coinfection model
The model simulates:
- TB infection, progression and recovery
- TB and HIV treatment the MDH cohort received
- IPT provision and IPT-induced INH resistance
- Health outcomes (TB incidence, mortality, LE, and DALYs)
- Economic outcomes (costs of care provision)

Parameter estimation and calibration

Some parameters directly estimated from MDH (CD4 trajectory, loss to follow-up rate, etc.)
Other uncertain/unobservable parameters such as TB force of infection and TB progression:

This is exactly why we need calibration

These parameters were estimated using a Bayesian calibration approach – a method that allows us to match the model to real-world outcomes and quantify uncertainty in those estimates.

Bayesian calibration process

Example calibrated parameters: TB force of infection, prevalence of LTBI and undiagnosed active TB at baseline…
Prior distributions: from published literature
Targets: TB incidence and mortality, all-cause mortality from MDH (over time)
Algorithm to sample from posterior distribution: Incremental Mixture Importance Sampling (IMIS)
Posterior sample captures parameter uncertainty and was used for probabilistic sensitivity analysis to estimate uncertainty intervals in outcomes

Bayesian calibration does not stop at point estimates of parameter values

Calibration results – TB cases

Calibration results – All-cause deaths

Calibration results – TB deaths

Validation

Once a model is developed, how do we assess if the results are credible?

Transparency:
Clearly describing the model structure, equations, parameter values, and assumptions to enable interested parties to understand the model

Validation:
Comparing model results with events observed in reality

Eddy DM, Hollingworth W, Caro JJ, Tsevat J, McDonald KM, Wong JB. Model Transparency and Validation: A Report of the ISPOR-SMDM Modeling Good Research Practices Task Force–7. Med Decis Making. 2012 Sep;32(5):733–43.

Once a model is developed, how do we assess if the results are credible?

Transparency:
Clearly describing the model structure, equations, parameter values, and assumptions to enable interested parties to understand the model

Validation:
Comparing model results with events observed in reality

Validation

Face Validity
The model reflects current scientific understanding, as judged by experts.

Internal Validation (verification)
The model is implemented correctly and behaves as expected (e.g., code verification).

External Validation
Model outputs are compared with empirical observations not used in model development.

Predictive Validation
Can the model reproduce data that wasn’t available during development?

Cross-model Validation
Do different models give similar results for the same question?

Eddy DM, Hollingworth W, Caro JJ, Tsevat J, McDonald KM, Wong JB. Model Transparency and Validation: A Report of the ISPOR-SMDM Modeling Good Research Practices Task Force–7. Med Decis Making. 2012 Sep;32(5):733–43.

Internal validation

External Validation

Predictive validation

Modeled projections of HIV prevalence, South Africa for 2012
→ 10 models fit to earlier data, then compared to 2012 prevalence survey

Eaton, Bacaër et al, Lancet Global Health, 2015

Cross-model validation/model corroboration

Eaton, Bacaër et al, Lancet Global Health, 2015

Cross-model validation/model corroboration

Take-Home Messages

Calibration aligns model predictions with real-world data by adjusting uncertain parameters
Key elements: well-defined calibration targets, goodness-of-fit measures, and a strategy to search the parameter space
Bayesian calibration provides a distribution of plausible parameter sets, helping to quantify uncertainty
Multiple forms of validation (internal, external, predictive, cross-model) build trust in the model’s outputs

Model calibration and validation

Learning Objectives

Evidence for decision analysis

Terminology

Evidence on modeled outcomes (calibration / validation targets)

Calibration

Why calibration?

Calibration – definition

Key components of a calibration

Goodness‑of‑fit measures

Qualitative (visual fit)

Target windows

Target windows

Distance‑based functions

Minimize sum of squared errors (SSE)

Likelihood‑based functions

Probability Mass/Density Function

Likelihood Function

Example: formulating a likelihood

Distance-/Likelihood- Based

Distance-/Likelihood- Based

Search strategies / optimisation algorithms

Hand calibration

Hand calibration

Grid Search

Random Search

Automated search/optimization routines

Nelder–Mead (simplex) – intuition

Nelder–Mead (simplex)

Nelder–Mead (simplex)

Nelder–Mead (simplex)

Nelder-Mead: Finding the Highest Point Without a Map

Probabilistic approaches (e.g. Bayesian calibration)

Prior and posterior distributions for model parameters included in the calibration

Bayesian calibration example

Simulation model

Parameter estimation and calibration

Bayesian calibration process

Bayesian calibration does not stop at point estimates of parameter values

Calibration results – TB cases

Calibration results – All-cause deaths

Calibration results – TB deaths

Validation

Once a model is developed, how do we assess if the results are credible?

Once a model is developed, how do we assess if the results are credible?

Validation

Internal validation

External Validation

Predictive validation

Cross-model validation/model corroboration

Cross-model validation/model corroboration

Cross-model validation/model corroboration

Take-Home Messages

Take-Home Messages

Further Reading