Title: Minimum Sample Size Calculation for Prediction Modelling Using Surrogate Modelling Authors: Felix Zimmer, Gordon Forbes, Diana Shamsutdinova, Daniel Stahl, Ewan Carr Abstract: The development of clinical prediction models is a complex process that requires careful planning. Adequate sample sizes are critical when collecting the initial data, as the model’s performance in the development sample should be comparable to its performance in a larger sample. Besides clinical sciences, prediction models are also valuable in psychometrics, for instance, in predicting implicit motives or postpartum depression. Guidance for minimum sample sizes has been developed using analytical methods for linear, logistic, or time-to-event outcomes. However, these approaches cannot be easily adapted to more complex scenarios. Several studies have used simulation to estimate minimum sample sizes, for example, using penalised logistic regression or clustered data. However, no generalised, simulation-based, framework or software package currently exists. Here, we introduce an efficient and flexible framework to derive minimum sample sizes for prediction modelling. We adopt a surrogate modelling approach originally designed for optimising statistical power. It uses Gaussian process regression to efficiently search for the minimum sample size, greatly reducing the number of simulations required. It is generalisable to any data type (e.g., clustered or longitudinal), modelling strategy (e.g., gradient boosted trees) or performance measure (e.g., AUC). Our workflow starts with specification of (i) a data-generating function, (ii) a modelling function, (iii) the expected performance of the model, and (iv) the range of sample sizes to evaluate. The framework then uses iterative simulation to identify the minimum sample size that replicates the large-sample model performance to within a specified precision threshold (e.g., within 0.05 of the expected AUC). We are currently developing a user-friendly R package for our approach. Our goal is to reduce research waste and improve patient outcomes by providing an efficient and flexible framework. Our approach will benefit researchers in clinical sciences and other fields that utilise prediction models.