Integrating multimodal data to meet clinical challenges

Why use a multimodal approach?

One of the remaining challenges in applying machine learning (‘ML’) methods, to any healthcare problem, is finding a comprehensive and patient-specific approach. Indeed, detecting a lesion or grade on a specific cell is relatively easy for a well-designed machine learning model. However, predicting a patient’s response to treatment or prognosis is a much bigger challenge. Firstly, one needs to thoroughly understand the patient. Analyzing a combination of data in various forms (integrating multimodal data) can significantly help solve these challenging clinical questions. To understand the benefits of thoroughly analyzing multimodalities, read our recent blog on the subject.

‍

Multimodalities in the clinical world

Of course, outside of ML, using multimodality is fundamental. For example, in French cancer treatment, the significant decisions about a patient’s treatment are always made during a multidisciplinary consultation meeting (Réunion de Concertation Pluridisciplinaire: RCP). This meeting gathers specialists from several specialties: radiologists, pathologists and oncologists. This way, physicians can have the most comprehensive understanding of the patient, their background, and the disease’s evolution.

Each modality teaches us very different yet crucial pieces of information:

Radiology data is excellent for judging the macroscopic scale of a tumor and seeing its evolution over time.
Histology slides help clinicians to understand the structure of the problem at a cellular scale. Additionally, analyzing the lymphocyte infiltration on a histology slide is an excellent indicator of organism resistance.
Genomic data gives us patient information that is not accessible through any other imaging modalities.
Clinical data (antecedents, age, sex, medical treatments, etc.) help clinicians to fully understand each patient’s specifics and the biological differences (heterogeneity) involved in their disease evolution.

‍

Multimodal data integration in ML

In ML, however, studies that involve multiple modalities are scarce, almost nonexistent.

But why?

The first reason is that finding a multimodal dataset is complicated and not easily accessible without deep connections to an entire hospital or research center. Secondly, standard papers and challenges (Camelyon16, deep lesion) focus only on one modality. As a result, models are currently developed and optimized for a single modality.

These single modality models force today’s ML models to focus on specific tasks like detecting certain types of cells or detecting and classifying lesions. At Owkin, thanks to our Owkin models, we can access multimodal datasets on several projects. This access allows us to adopt a more comprehensive multimodal data approach to tackle more challenging and global tasks such as predicting a patient’s response to treatment.

There are many machine learning techniques to learn from multiple modalities; two of them are presented below.

‍

Method 1. Stacking modalities for multimodal data integration

How does it work?

In the standard scenario, we have multiple modalities (for example, images and genomic data) and a label that we want to predict. This label might be a binary target, survival times or any structured outcome. In neural networks, it corresponds basically to a two meta-layer architecture to encompass multiple modalities. The first meta-layer is composed of various networks (one for each modality). The second meta-layer is a more extensive network that combines outputs (individual modality representations) and predicts the labels of interest.

‍

**Figure 1:** In this example that combines images with gene mutations (i.e., tabular data), one could set up:

‍

A convolutional network to analyze histology images;
A dense network to analyze the tabular data;
A radiomic extractor to extract information from radiology images;
All these networks will project to a pivot layer; and
A dense network to connect the pivot layer to the final layer attached to the target

‍

Why would I choose this strategy?

This strategy is particularly powerful as we are leveraging as much as possible the interactions between the different modalities. By training in an end-to-end fashion, we are taking full advantage of the different information contained in each sample.

‍

Method 2. Representation learning: an indirect way to encompass multiple modalities within a ML model

How does it work?

Firstly, the model assesses each modality, both separately and individually (steps A, B, and C in the image below). Only during the last step (D in the image below) does the final predictive model incorporate all of the individual results from each modality. For example, one would run a model on a histology slide, then another on genomics, and another on radiology. The results only merge through a final predictive model at the end.

In this two-step process, every modality is tackled first, then incorporated— making it an indirect way to deal with multimodal integration when compared to Stacking Modalities as described above, where all the model includes all modalities from the start.The final model will use the representation and result from each of the individual modality cohorts.

‍

**Figure 2:** Representation learning from a single data modality using all available patients in a cohort (A, B, C) and the final model, which combines all the learned representations to generate predictions (D).

‍

Why would I choose this strategy?

This strategy is extremely interesting especially in the cases where you cannot access all modalities for each patient, a very common situation in current datasets. Indeed you will leverage every data available in your unimodal models which will improve the global performance. Furthermore, this method allows us to evaluate easily the contribution of each modality and the improvement associated with multimodality.

How to use multimodality for interpretability: HE2RNA

Although the solutions described above are the most obvious ones, we believe that researchers can use a combination of multiple modalities in other ways, which are often much more interesting.

Owkin’s recent paper published in Nature Communications is another example of how to integrate multiple modalities. This paper proposes a novel approach to predicting RNA-Seq expression of tumors based on combining various data modalities. It showed that our deep learning model, HE2RNA, can be trained to systematically predict RNA-Seq profiles from whole-slide images alone, without the need for expert annotation. Furthermore, the HE2RNA model is interpretable by design. Meaning that (unlike traditional BlackBox models), researchers can see exactly why the model has drawn the conclusions it has. This interpretability feature opens up a host of new opportunities. Indeed, it provides the virtual spatialization of gene expression (allowing you to see exactly where the gene expression occurs on the histology image). This way, you could, for example, understand which genes selected as relevant by another ML method (as presented below) are highly expressed in certain parts of the images.

Therefore, in the HE2RNA case, multiple modalities are integrated to allow the data to be interpretable.

‍

**Figure 3:** Transcriptomic learning for digital pathology. Owkin used transcriptomic representation for the virtual spatialization of transcriptomic data. For each predicted coding or non-coding gene, the model calculates each tile a score on the corresponding whole-slide image. These predictive scores generate heatmaps for each gene that the model had predicted significant gene expression.

‍

Multimodal data integration conclusion

At Owkin, our Lab team has unique expertise at the intersection of data science, statistics, and biology. The Owkin Lab divides into several teams, each dedicated to a single data modality, including clinical data, radiology, pathology, and genomics. With 14 PhDs and experts in machine learning, science, engineering, biology, biostatistics, translational research, and in-house medical doctors, the lab is unique and perfectly positioned at the forefront of multimodal ML research in healthcare.

‍