Clinical healthcare datasets are an expensive prerequisite for conducting medical research with machine learning. Comprised of rare pathologies, these datasets are often smaller in sample size and can be hard to acquire. Discover how this machine learning technique, alongside Owkin technologies, can help to effectively deploy AI on these datasets.
- What is Transfer Learning?
- Benefits of Transfer Learning
- Why is it important in healthcare?
- Transfer Learning in healthcare examples
What is Transfer Learning?
Transfer learning is a machine learning (ML) technique in which one reuses a model developed for one task as the starting point for a model trained on a second task. Said differently, we transfer knowledge acquired on one dataset to another, hence the name “transfer learning”. For example, humans have an inherent ability to transfer knowledge across tasks. If we already know how to ride a motorbike, we will find it easier to learn how to drive a car.
The training of classical ML models requires high-quality data in high volumes. Without sufficient data, the algorithm will either not learn correctly or not learn at all. The conventional approach is to train a machine learning model for each dataset. When the model has no prior knowledge before the training, we call this, “training from scratch”. In this setting, the algorithm will need to learn to discriminate visual patterns without any prior knowledge.
Let’s examine a model designed to detect tumors from microscope whole slide images.
This model will need to understand high-level graphic patterns, such as tumors, that are themselves relying on very low-level visual patterns— basic shapes, like the circles of the cell nuclei— in order to fully learn the tumor’s visual pattern. Again, transfer learning is a machine learning technique that allows a model pre-trained on one task to be used on another related task. In our example, this model has already learned how to detect these low-level visual patterns from another readily available dataset, and will leverage that knowledge to detect the low-level patterns in the dataset of interest. Then, it will use these low-level patterns to detect high level patterns.
For instance, as we see in the picture above, in order to detect tumors, we used a network that was pre-trained on a dog dataset comprised of thousands of images. Thanks to this, the network will be able to detect low-level patterns, such as circles that are represented by the black dots in the Dalmatian’s coat. In the case of our microscope images, this same circle detector will detect cell nuclei. Accordingly, the network will use these low-level patterns, the cell nuclei, to detect high-level patterns, like the tumor itself.
Benefits of Transfer Learning
If applied correctly, Transfer Learning has several benefits, including: quicker training times, improved model performance, and needs less data to train on the task of interest. Andrew Ng, a Deep Learning pioneer, described Transfer Learning as “the next driver of ML commercial success after supervised learning.”
As such, Transfer Learning is a popular approach in ML – particularly in computer vision and natural language processing tasks – given the enormous amount of data required to train Deep Learning models. Some datasets are huge and contain enough data to train Deep Learning algorithms from scratch; for instance, the ImageNet dataset, one of the most famous computer vision datasets, contains 14 million images. However, in healthcare, datasets are often much smaller, meaning Transfer Learning is required to train Deep Learning methods.
Why is it important in healthcare?
In Healthcare, the size of the datasets available is a fundamental issue. Many medical datasets are only a few hundred samples wide. These small datasets are the standard because:
- Medical datasets often correspond to rare pathologies. Hence, the number of pictures of a rare disease is dependent on the number of patient cases with this disease;
- Medical datasets are often expensive to acquire. Therefore, most datasets are collected for a particular study or clinical trial. Furthermore, the acquisition and cleaning of this data is costly and uses a significant portion of a study’s funds;
- Medical datasets are often difficult to acquire when they are split across different hospitals. It is easier to collect datasets in a mono-centric fashion (e.g., from one hospital) but this limits their size. Collecting datasets in a multi-centric manner (e.g., from many hospitals) will increase the dataset’s size. Still, unfortunately, this method has greater costs, and poses lots of issues in terms of costs, data privacy, and data bias.
Aside from the limited size of healthcare datasets available, there is an additional challenge to overcome in the context of ML – “The Curse of Dimensionality.”
In simple terms, “the curse of dimensionality” means the larger each sample is, the more data you need. Said differently, the more pixels you have on an image, the more detail you can store in your image. For example, if you have just a thumbnail of an image compared to the same full-resolution image, there will be less detail in the thumbnail. While this seems like an advantage, for a machine learning model, this is complex because the model has to learn how to analyze the additional detail. And for the model to learn from this additional detail, this means more data is required.
Let’s take our example of the ImageNet dataset into consideration again. We know that despite containing 14 million images, each image is only a few megabytes in size. Healthcare sample sizes are significantly larger, often by several orders of magnitude. This is the case, for example, for the Camelyon dataset, which is publicly available. This consists of only 400 histology images but each image is a vast one gigabyte in size (See figure 1).
A visual illustration of the specificities of computer vision datasets in healthcare
Transfer Learning in healthcare examples
We can use Transfer Learning in healthcare to apply ML to smaller datasets and overcome “the curse of dimensionality.” Furthermore, researchers and data scientists can apply this technique in different ways. It can be leveraged depending on the context, the datasets involved, or the use-cases. In this article, we describe two positive examples of using Transfer Learning in healthcare; firstly, to train machine learning models on smaller healthcare datasets and secondly, to test the generalization capability or validate the performance of a model.
Example 1 – Train machine learning models on smaller healthcare datasets
The Camelyon dataset is a library of 400 histology slides (tissue samples). This dataset was available via the Camelyon16 grand pathology challenge to detect tumors from pathology slides. However, with only 400 samples and each sample being 1Gb in size, it was impossible to apply a traditional ML algorithm to this dataset due to “the curse of dimensionality”. Therefore, we had to find an alternative approach.
A Camelyon Histology Image
Owkin applied Transfer Learning to its innovative CHOWDER algorithm. This algorithm can detect tumors from histology images using an approach called weakly supervised learning, in which only image-level labels are available during training. We did not train the CHOWDER algorithm entirely on the Camelyon dataset. Instead, we pre-trained the first part of the algorithm, which represents 99% of the learning burden, on the ImageNet dataset described above and in figure 2. Then, the remaining part of the algorithm was trained on the Camelyon dataset.
Thanks to this technique, the final algorithm recognized tumors in the Camelyon dataset with excellent precision – thanks to its pre-training on images of cats, dogs and planes (see figure 3)!
Example 2: Testing the performance of a model
In healthcare, it is essential to test the quality of a model by validating its performance on a separate dataset from the training dataset. This validation step tests the “generalization capacity.” For example, in the paper “Deep learning-based classification of mesothelioma improves prediction of patient outcome”, (Courtiol, Maussion, et Al, 2019), the CHOWDER algorithm was trained on one dataset, Mesobank (a French database of Mesothelioma histology samples) and validated on The Cancer Genome Atlas (TCGA) dataset (a publicly available US database of cancer samples). TCGA is independent of Mesobank. Thus, each dataset collects the data in a context different enough (a different country) to guarantee that the new performances accounted correctly for the model’s generalization capacity. While this is not strictly Transfer Learning in the traditional sense, this kind of technique falls in the general scope of methods used to enrich a model with external datasets (see figure 4).
Example of the use of multiple datasets to improve performances
Transfer Learning is a fundamental part of the development of ML in healthcare. Its use makes it possible for researchers to tackle “the curse of dimensionality”. It is Owkin’s mission to leverage medical data as a whole. Through our unique platform, we aim to fuel collaborative research around the world while preserving patient privacy and data security. Our proprietary federated learning infrastructure enables researchers to train ML models on distributed data at scale. Which means we can conduct research across multiple medical institutions without centralizing the data. Find out more about Federated Learning on our website. With this unique model sharing technology, Owkin is taking Transfer Learning in healthcare to the next level.