In this post, we describe Virtual Staining, a novel machine learning technique developed at Owkin, that allows researchers to generate a virtual immunostained histopathology slide from an H&E stained specimen slide.
The advent of generative models
We train models to help us understand the world around us: most models try to simplify complex data so that we can make sense of it more easily. They take high dimensional inputs, such as images, tables, text, and compute an output in a low dimensional space such as a binary category, a value between 0 and 100, or an area on an image.
However, with recent advances in machine learning and access to more and more computational resources, we are now able to train models that can provide stable, high dimensional outputs. We have taken the step from descriptive models to generative models that can reliantly generate sensible music, text, or images. The applications are endless and are already impacting the real world.
At Owkin, we are exploring the opportunities of this technology in the field of medical research. We have developed Artificial Intelligence algorithms to analyze and interpret multimodal medical data: pathology images, radiology images, genetic data, lab analysis, and clinical outcomes.
Biopsies are the reference procedure to diagnose cancer. When extracted, the specimens are cut into slices of tissues, which are usually stained hematoxylin and eosin (H&E), to highlight cells in the characteristic purple-bluish color. These are the standard digital pathology slides, studied with a microscope by anatomopathologists to diagnose and research cancer. To highlight a relevant biomarker, pathologists will highlight the slide with a specific immunohistochemical (IHC) staining. This IHC stain will only bind with the specific marker and visually highlights it.
This blog post details how we developed a technology, called Virtual Staining, which allows us to compute an IHC slide from a standard H&E slide in silico, using a generative model.
Histology and Project Background
In histology, the H&E stain is the gold standard for medical diagnosis. The hematoxylin stains cell nuclei in blue, while eosin stains the extracellular matrix and cytoplasm in dark pink. Pathologists often combine this stain with IHC stainings to selectively identify relevant antigens in the cells. This enables pathologists to visualize and understand the distribution of specific biomarkers in the tissues to inform diagnosis and tumor grading.
One common diagnostic biomarker is CD3, a co-receptor for T-cell activation. The CD3 immunostain can detect normal and neoplastic T-cells as well as immune infiltration of the tumor microenvironment. This stain has important prognostic value, as T-cell activation is linked to overall immune response. However, immunostainings are not routinely performed in research cohorts and difficult to do in retrospective studies.
To overcome this workflow friction, we decided to explore the possibility of using a generative model to compute the supplemental CD3 staining from the standard H&E staining. The idea was to leverage existing work on generative models, that notably allow researchers to perform image to image translation, and to apply it to the complex digitized pathology data structures. The resulting data enrichment benefits were promising enough for us to dedicate significant time and resources to this research project.
The first challenging step was to acquire relevant data to train a Virtual Staining model. We first needed to create a dataset of aligned images. For the algorithm to work correctly, the aligned images must be pairs of source and target images that are perfectly aligned.
IHC staining is usually done on a consecutive section of the same tissue block as the H&E section. However, this procedure would not work in our case, as we needed to work on the exact same tissue section. If we took consecutive sections of the tissue, we would have a difference of 3 or 4 µm, which means the exact cells in the whole-slide image (WSI) would change and we would not be able to train our Virtual Staining model.
To overcome this issue, we developed a special protocol in collaboration with a pathologist in a major French hospital:
- Cut a section of tissue, stain it in H&E and scan it. This is the normal protocol to digitize WSI, and it ensures that the H&E WSI has the highest quality
- Remove the upper glass slide, and wash out the H&E stain
- Stain the tissue again, this time with an IHC marker (in our case, CD3 stain), and scan it
This protocol requires good manipulation skills to avoid damaging the tissue, but after careful execution, we obtain two digitized WSIs of the same tissue section: one stained in H&E and one stained in CD3.
We then need to align the two WSIs so that they match perfectly, cell to cell. This is a difficult task — imagine aligning two satellite images of a city taken at different times, in different weather conditions. We first align a lower resolution version of the slides using a rigid transformation (translation and rotation). We then apply the same transformation to the whole WSI and create a new transformed aligned WSI. This process is called slide registration. After it is completed, we have two WSI: the H&E WSI and the perfectly aligned CD3 WSI.
We can see in the image below that the two WSIs are now aligned down to the cell level. There are still some very small variations visible. One additional refinement step is to align again at the local level, to obtain a perfect match between patches of each slide.
Pix2pix Framework Generator and Discriminator
Our Virtual Staining model had to learn to translate images from one domain (H&E) to another domain (CD3 IHC). This type of task is called supervised image to image translation and is an active area of research. The state-of-the-art model for this is called pix2pix and was introduced in “Image-to-Image Translation with Conditional Adversarial Networks”¹. Typical applications include colorizing black & white photos or turning sketches into pictures.
Pix2pix is based on the generative adversarial network (GAN) framework. This means that there are two networks competing against each other during training. The discriminator is trained to discriminate between “real” and “fake” target images — in our case, predict if the CD3 staining is “real” or “generated”. The generator is trained to generate CD3 images that look real and can trick the discriminator. As training goes on, the generator gets better and better at tricking the discriminator, and the discriminator gets better and better at identifying fake samples, forcing the generator to improve. At the end of the training, if everything goes well, the fake and the real samples are indistinguishable: the model has learned to generate a convincing CD3 staining.
In practice, the adversarial game between generator and discriminator renders the training of GANs unstable, and requires tweaking and parameter exploration. Moreover, the model needs enough training samples to successfully generate images. We found that using the LSGAN² variant which uses the least squares loss function for the discriminator improved results and stability.
We also use a simple L1 loss between the fake generated image and the real image, as an additional supervised signal. This loss alone is not enough to train a good generator because it smoothes out the output and does not generate realistic images.
Applying Pix2pix to Histopathology Images
A digitized WSI is a very large image: the files are often more than 100,000 pixels wide and can weigh up to multiple gigabytes. Existing image generation algorithms work on images up to 1,000 pixels wide, so we needed to adapt them. We divided the WSI into a grid of tiles — small patches of the image that are 512 pixels wide — and trained the model on them. In a typical slide, there can be around 25,000 such tiles. We also used a simple matter detection algorithm to avoid extracting empty tiles.
The model is then trained on these aligned pairs of tiles. To increase the variety of images seen during training, we add data augmentation: random flips (applied to the two images) and color augmentation. We also randomly sample the location of the input tiles to avoid having a fixed grid of tiles.
Results and Further Experiments
Visually, the results are very promising. It is difficult for a non-expert to identify which of the CD3 images is real, and which is generated by the model.
We display below generated CD3 images at two resolutions, to illustrate how the pattern of CD3 cells overlaps between the real CD3 staining and the virtual one.
At maximum resolution (microns per pixel = 0.25), which corresponds to the resolution on which the model was trained, we can see that most of the CD3 cells are detected as such by the generator. The model is not perfect though and there are some minor inconsistencies between the real CD3 staining and the predicted one.
When we zoom out by 10x, at a resolution of 2.5 microns per pixel, we can see that the overall pattern of CD3 cells match perfectly between the real and generated CD3 stainings.
We now need to create appropriate metrics to evaluate quantitatively the performance of our Virtual Staining model. To measure how well the generator works at the local level, our metric needs to identify how many CD3 cells were detected. For this, we want to compute the precision and recall of the CD3 cell detection. The first step is to run a positive cell detection algorithm on both true and generated CD3 images. We then count the number of true CD3 cells that match with generated CD3 cells, which gives us true positive (TP), false positive (FP), and false negative (FN). From this, we can compute the precision and recall. On the CD3 test images, we obtain a precision of 72% and a recall of 63% for CD3 cell detection. This is a very hard metric because the Virtual Staining model needs to perfectly find all CD3 cells, and also because the metric depends on the cell detection algorithm to work correctly.
In practical applications, we don’t need to have a staining precision at the cell level, and pathologists mainly want to identify regions of the whole-slide image containing T-cells. To measure how well the Virtual Staining works at a global level, we compute the correlation of CD3 intensity between the real and predicted WSIs. For each tile in the real and predicted WSIs we compute the average CD3 intensity (brown intensity in the image). We then compute the correlation between the list of real CD3 intensities and the list of predicted CD3 intensities. On the CD3 test images, we obtain a Pearson correlation coefficient of 93.9% and a Spearman correlation coefficient of 93.6%. This indicates that the Virtual Staining model can be confidently used to quantify local variations in CD3 expression across the slide.
This project on CD3 Virtual Staining was a successful proof of concept. The high correlation between real and predicted localization of CD3 indicates that we could use Virtual Staining as a way to unveil rich information contained in H&E slides. This new cellular information can be used as data enrichment to improve our models in many research projects, as well as to develop new interpretability methodologies.
The numerous potential applications surrounding this Virtual Staining technique led us to launch the development of a technology platform specifically focused on Virtual Staining, and embedded in our software Owkin Studio. In other words, not only generating one CD3 marker but creating an environment to quickly build, test, validate, store, and use all virtual stains that are relevant to medical research.
We are actively developing the Virtual Staining platform, now available through our software Owkin Studio. Visit Owkin Virtual Staining to learn more about our automated immunostains.
. Isola, Phillip, et al. “Image-to-image translation with conditional adversarial networks.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2017.
. Mao, Xudong, et al. “Least squares generative adversarial networks.” Proceedings of the IEEE International Conference on Computer Vision. 2017.