Title: Minimum Sample Size Calculation for Prediction Modelling Using
Surrogate Modelling

Authors: Felix Zimmer, Gordon Forbes, Diana Shamsutdinova, Daniel Stahl, Ewan
Carr

Abstract:
The development of clinical prediction models is a complex process that
requires careful planning. Adequate sample sizes are critical when collecting
the initial data, as the model’s performance in the development sample should
be comparable to its performance in a larger sample. Besides clinical
sciences, prediction models are also valuable in psychometrics, for instance,
in predicting implicit motives or postpartum depression. Guidance for minimum
sample sizes has been developed using analytical methods for linear, logistic,
or time-to-event outcomes. However, these approaches cannot be easily adapted
to more complex scenarios. Several studies have used simulation to estimate
minimum sample sizes, for example, using penalised logistic regression or
clustered data. However, no generalised, simulation-based, framework or
software package currently exists. Here, we introduce an efficient and
flexible framework to derive minimum sample sizes for prediction modelling. We
adopt a surrogate modelling approach originally designed for optimising
statistical power. It uses Gaussian process regression to efficiently search
for the minimum sample size, greatly reducing the number of simulations
required. It is generalisable to any data type (e.g., clustered or
longitudinal), modelling strategy (e.g., gradient boosted trees) or
performance measure (e.g., AUC). Our workflow starts with specification of (i)
a data-generating function, (ii) a modelling function, (iii) the expected
performance of the model, and (iv) the range of sample sizes to evaluate. The
framework then uses iterative simulation to identify the minimum sample size
that replicates the large-sample model performance to within a specified
precision threshold (e.g., within 0.05 of the expected AUC). We are currently
developing a user-friendly R package for our approach. Our goal is to reduce
research waste and improve patient outcomes by providing an efficient and
flexible framework. Our approach will benefit researchers in clinical sciences
and other fields that utilise prediction models.