Generalizability

A way to determine how well an algorithm works in a new setting.

An algorithm is described as being ‘generalizable’ if it works equally well in multiple locations, for example multiple hospitals, and not just in the original location in which it was trained. This is crucial in healthcare research, as if the results cannot be applied (or ‘generalised’) beyond the specific context in which the study took place, it will be difficult to use them in practice on a broader population in real-world settings. 

Generalisability is, therefore, how well an algorithm works in a new setting. For example, an algorithm that was able to recognise breast cancer in scans taken at a hospital in London where it was trained and also able to recognise breast cancer in scans taken at a hospital in Paris would demonstrate good generalizability. However, if the algorithm was only able to recognise breast cancer in scans taken at its training hospital in London and unable to recognise breast cancer in scans taken at a hospital in Paris it would be considered to have poor generalizability.

Researchers can improve the generalizability of their studies by:

  • randomly selecting individuals for the study (using ‘random sampling’) with the goal of reducing the likelihood of bias 
  • increasing the number of people participating in the study, to increase confidence that the findings of the study are significant for a large group 
  • selecting a more diverse representative population or ‘sample’, so results are more readily applicable to many different types of people
  • training models across data from multiple hospitals, so results are more likely to work in multiple different situations

Using a combination of the techniques above, from the outset of a project, leads to more generalisable models that are also less likely to be biased.

Generalisability refers to the ability of an algorithm to cope with new data. For example, if an algorithm designed to classify breast cancer tumours (i.e., differentiate between different types of tumours) can do this well in Hospital A responsible for training the algorithm, and performs equally well in Hospital B, then the algorithm can be considered generalisable. However, if the algorithm is unable to perform in Hospital B or performs poorly (for example often misclassifying tumours), then the algorithm would not be considered generalisable. Whilst there is some debate about whether generalisability is always necessary, it is usually considered an essential ‘success’ criteria

There are different levels of generalizability: 

  • Internal
    Only generalisable within the context it was trained (i.e., the algorithm performs equally well on different segments of the training set within one hospital) 
  • Temporal
    Generalisable on new data created in the same setting in which it was trained (i.e., the algorithm performs equally well on new patients being treated at the same hospital in which it was trained)
  • External
    Generalisable across settings (i.e., the algorithm performs equally well in Hospital A and Hospital B). 

In most instances, the aim would be to achieve the highest possible level of generalisability (i.e. to achieve external generalisability). However, in some circumstances, this may not be possible. For example, if the algorithm in question is used for the diagnosis, prognosis, or treatment of a rare condition that is only treated in one hospital. 

If an algorithm that is supposed to be externally generalisable fails to generalise, this is likely the result of overfitting, underfitting, bias, or dataset drift:

  • Overfitting
    This is when an algorithm becomes so closely ‘fit’ to its training data that it also begins to learn noise and irrelevant details that may not be present in new data. The most ‘famous’ example of this is an algorithm that was trained to distinguish between huskies and wolves. The algorithm performed very well on the training dataset, but missed very obvious examples of wolves when it was tested. It was discovered that all the photos of wolves in the training dataset featured snow in the background. The algorithm had ‘overfit’ to the training dataset and expected all wolves to be pictured with snow. When it was presented with images of wolves in landscapes devoid of snow, it failed. Similar examples have occurred in healthcare settings, when an algorithm has learned to classify a type of tumour based on what scanner was used to produce the image. 
  • Underfitting
    This is the opposite of overfitting, and it is when an algorithm does not learn the features of its training data well enough and cannot perform well on either the training dataset or new data it is presented with. 
  • Bias
    If, for example, an algorithm is trained to predict the risk of myocardial infarction (heart attack) on a dataset that is made up of 80% male patients and 20% female patients, and then deployed in a hospital that sees 50% male patients and 50% female patients, it will likely more accurately predict the risk of myocardial infarction for the male patients than it will for the female patients. This is because the algorithm has been trained on a non-representative (or biased) dataset, and the symptoms and risk factors of myocardial infarction in women are known to be different to those in men. 
  • Dataset drift
    There are multiple different types of dataset drift, but in essence this is where the make-up (i.e., the demographics) of a population shift over time. For example, if the population of a local hospital ages and very few younger patients move into the area to balance out the shift, or a local area becomes more diverse over time. It is unlikely that these ‘new patients’ were included in the training dataset, even if it was representative for the population at the time, and now the algorithm is unable to generalise to the new population. 

Of these different causes of poor generalisability, overfitting is the most common.

 Methods for mitigating the risks of overfitting  
  • Increasing the size and diversity of the training dataset
    Either through aggregating more datasets or via federated learning
  • Data augmentation
    Adding new data, or noise, to a training dataset at pre-agreed intervals. 
  • Feature selection
    Identifying the most important features within the training dataset and eliminating those that are irrelevant or redundant (such as removing the snow in the wolves vs. huskies example). 
  • Regularisation
    Sometimes overfitting occurs because an algorithm becomes too complex. Regularisation refers to the process of identifying and reducing noise within the data, when feature selection is not possible (for example, when it is not known which features are the most relevant in advance).

These methods will improve the generalizability of a model upfront. However, it is always possible (due to issues such as dataset drift) that generalizability will degrade over time. This is why it is important to continuously monitor the performance of an algorithm after it has been deployed so that generalizability errors can be noted and dealt with.

An Owkin example

In a paper published July 2023 in npj precision oncology, Owkin scientists used deep learning to predict patients outcome and mutations from digitized pathology slides in gastrointestinal stromal tumor (GIST). All models were trained using cross-validation with the whole cohort and evaluated within these subgroups using data from Hospital 1 and tested independently using the data from Hospital 2. In doing so we ensured the results would generalize better to other settings and could with future validation be potentially integrated into clinical practice.

Further reading
  • Challen, Robert et al. 2019. ‘Artificial Intelligence, Bias and Clinical Safety’. BMJ Quality & Safety 28(3): 231–37.
  • Farah, Line et al. 2023. ‘Assessment of Performance, Interpretability, and Explainability in Artificial Intelligence–Based Health Technologies: What Healthcare Stakeholders Need to Know’. Mayo Clinic Proceedings: Digital Health 1(2): 120–38.
  • Futoma, Joseph et al. 2020. ‘The Myth of Generalizability in Clinical Research and Machine Learning in Health Care’. The Lancet Digital Health 2(9): e489–92.
  • Kocak, Burak, Ece Ates Kus, and Ozgur Kilickesmez. 2021. ‘How to Read and Review Papers on Machine Learning and Artificial Intelligence in Radiology: A Survival Guide to Key Methodological Concepts’. European Radiology 31(4): 1819–30.
  • Wan, Bohua, Brian Caffo, and S. Swaroop Vedula. 2022. ‘A Unified Framework on Generalizability of Clinical Prediction Models’. Frontiers in Artificial Intelligence 5: 872720.