Validation

The process of testing to see whether an algorithm works well.

Validation refers to the process of testing to see whether an algorithm ‘works,’ or simply put, can the algorithm do the job it was trained to do well? 

This typically involves giving the algorithm new data it hasn’t seen before and testing to see whether it can still correctly perform the task it was designed for. Can it classify the images (i.e., tell the difference between a healthy and unhealthy scan) or predict what’s going to happen to the patient (i.e., identify that a patient is at high-risk of developing diabetes) based on electronic health record (EHR) data? In this way, the process of validation is very similar to the process of sitting an exam at the end of a course at school or university: an exam is a way of validating whether you have genuinely learned and understood a particular topic. 

Here’s the typical process when testing a healthcare AI algorithm:

  1. The algorithm is trained on data from Hospital A
  2. It achieves high levels of accuracy at prediction
  3. It is given data from Hospital B to see if it can still make accurate predictions

To ensure generalisability, the ‘unseen’ dataset used for validation should come from (a) a different location (e.g., a different hospital) and/or (b) a different time period. The exact metric (or grading system) used to ‘test’ the algorithm will depend on the context i.e., what the algorithm is designed to do, and what type of algorithm it is. For more, see C-Index

After an algorithm has been trained, it’s important to test the model to ensure its ‘learned knowledge’ is accurate, complete, and consistently capable of providing a suitable output – be that ‘advice’ (in the form of clinical decision support), classification or prediction. This is a process known as validation. 

This is crucial because it can highlight issues such as bias, overfitting and the resulting poor generalisability, and problems with the overall accuracy or performance of the model such as a tendency to generate excessive false positives. If these problems are not caught early in the development process of an algorithm, once it’s deployed, the algorithm could cause harm to patients through misdiagnosis or missed diagnosis, or undermine scientific endeavours by producing results overly reliant on spurious correlation. 

The validation process

Exactly how a model is validated will vary depending on the type of model, what it’s designed to do, the wider context in which it was designed, and the clinical outcome it’s targeting. However, there are some generally applicable guidelines, as follows: 

 

  1. Validation should involve multiple stages: 
    a.     Internal validation: validation of performance against a subset of the training dataset. 
    b.    Temporal validation: validation from the same clinical centre (e.g., the same hospital) as the training dataset, but from a different time period.  
    c.     External validation: validation of performance against a dataset comprised of patients from an entirely different clinical setting (e.g., a completely different hospital)
  2. Validation should include testing for robustness i.e., how well can the model cope with different case mixes of patients, variations in missingness and uncertainty, and variations in instrumentation and care protocols between different settings. 
  3. Validation should include benchmarking or the comparison of the performance of the model to either another model, another similar intervention, or to human performance. 
  4. Validation should involve a range of statistical performance tests and metrics to identify, in detail, the strengths and weaknesses of the model. Such metrics might include the ROC AUC, mean, absolute mean, mean squared error, rand error, warping error, positive-predictive value, concordance-index, and more. Other metrics used for testing recommender systems may also be used. The exact selection of metric will depend on the type of model (e.g., supervised or unsupervised) and the type of clinical task (e.g., classification or prediction) 
  5. Validation should include the transparent and open publication of results in accordance with relevant reporting guidelines e.g., MINIMAR (MINimum Information for Medical AI Reporting) and wherever feasible, all statistical metrics should be reported with confidence intervals or measures of statistical significance. 
  6. Validation studies should be conducted by multi-disciplinary teams involving statisticians, clinical informaticians, data scientists, and clinicians (at a minimum). 

 

In other words, rigorous technical validation requires testing for accuracy, reliability, and robustness. This will ensure that the model’s ‘knowledge’ is accurate and can ‘perform’ well under a variety of different conditions. However, technical accuracy is not the same as clinical efficacy.  

Clinical efficacy can only be determined by testing models in real-life clinical settings i.e. clinical evaluation. This might involve conducting a randomised controlled trial, a cluster trial, observational study, or time series analysis. Generating evidence of efficacy using one of these methods, depending on the level of ‘risk’ associated with the model, is a legal requirement if the model is intended to be used in a medical device (i.e., deployed in a clinical setting and not used just for research purposes). 

An Owkin example

Owkin published a study in Nature Communications November 2023 validating our MSIntuit CRC AI-based pre-screening tool for microsatellite instability (MSI) detection from colorectal cancer histology slides. After training on samples from The Cancer Genome Atlas (TCGA), a blind validation was performed on an independent dataset of 600 consecutive CRC patients. Inter-scanner reliability was studied by digitising each slide using two different scanners. MSIntuit CRC yields a sensitivity of 0.96–0.98, a specificity of 0.47-0.46, and an excellent inter-scanner agreement (Cohen’s κ: 0.82). By reaching high sensitivity comparable to gold standard methods while ruling out almost half of the non-MSI population, we showed that MSIntuit CRC can effectively serve as a pre-screening tool to alleviate MSI testing burden in clinical practice.

Further reading
  • Bindels, R., Winkens, R. A. G., Pop, P., van Wersch, J. W. J., Talmon, J., & Hasman, A. (2001). Validation of a knowledge based reminder system for diagnostic test ordering in general practice. International Journal of Medical Informatics, 64(2–3), 341–354. https://doi.org/10.1016/S1386-5056(01)00207-6
  • Bulgarelli, L., Deliberato, R. O., & Johnson, A. E. W. (2020). Prediction on critically ill patients: The role of “big data”. Journal of Critical Care, 60, 64–68. https://doi.org/10.1016/j.jcrc.2020.07.017
  • Char, D. S., Abràmoff, M. D., & Feudtner, C. (2020). Identifying Ethical Considerations for Machine Learning Healthcare Applications. The American Journal of Bioethics, 20(11), 7–17. https://doi.org/10.1080/15265161.2020.1819469
  • England, J. R., & Cheng, P. M. (2019). Artificial Intelligence for Medical Image Analysis: A Guide for Authors and Reviewers. American Journal of Roentgenology, 212(3), 513–519. https://doi.org/10.2214/AJR.18.20490
  •  Hernandez-Boussard, T., Bozkurt, S., Ioannidis, J. P. A., & Shah, N. H. (2020). MINIMAR (MINimum Information for Medical AI Reporting): Developing reporting standards for artificial intelligence in health care. Journal of the American Medical Informatics Association, 27(12), 2011–2015. https://doi.org/10.1093/jamia/ocaa088
  •  Hilal, I., Afifi, N., Ouzzif, M., & Belhaddaoui, H. (2015). Considering dependability requirements in the context of Decision Support Systems. 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA), 1–8. https://doi.org/10.1109/AICCSA.2015.7507207
  •  Lisboa, P. J. G. (2002). A review of evidence of health benefit from artificial neural networks in medical intervention. Neural Networks, 15(1), 11–39. https://doi.org/10.1016/S0893-6080(01)00111-3
  •  Miller, P. L. (1986). The evaluation of artificial intelligence systems in medicine. Computer Methods and Programs in Biomedicine, 22(1), 3–11. https://doi.org/10.1016/0169-2607(86)90087-8
  •  Neves, M. R., & Marsh, D. W. R. (2019). Modelling the Impact of AI for Clinical Decision Support. In D. Riaño, S. Wilk, & A. ten Teije (Eds.), Artificial Intelligence in Medicine (Vol. 11526, pp. 292–297). Springer International Publishing. https://doi.org/10.1007/978-3-030-21642-9_37
  •  Ngiam, K. Y., & Khor, I. W. (2019). Big data and machine learning algorithms for health-care delivery. The Lancet Oncology, 20(5), e262–e273. https://doi.org/10.1016/S1470-2045(19)30149-4
  • Obermeyer, Z., & Emanuel, E. J. (2016). Predicting the Future—Big Data, Machine Learning, and Clinical Medicine. New England Journal of Medicine, 375(13), 1216–1219. https://doi.org/10.1056/NEJMp1606181
  • Olakotan, O. O., & Yusof, M. Mohd. (2020). Evaluating the alert appropriateness of clinical decision support systems in supporting clinical workflow. Journal of Biomedical Informatics, 106, 103453. https://doi.org/10.1016/j.jbi.2020.103453
  •  Parikh, R. B., Obermeyer, Z., & Navathe, A. S. (2019). Regulation of predictive analytics in medicine. Science, 363(6429), 810–812. https://doi.org/10.1126/science.aaw0029
  •  Park, S. H., & Han, K. (2018). Methodologic Guide for Evaluating Clinical Performance and Effect of Artificial Intelligence Technology for Medical Diagnosis and Prediction. Radiology, 286(3), 800–809. https://doi.org/10.1148/radiol.2017171920
  •  Reisman, Y. (1996). Computer-based clinical decision aids. A review of methods and assessment of systems. Medical Informatics, 21(3), 179–197. https://doi.org/10.3109/14639239609025356