Multimodal data

Data that is captured in different sources and formats and then joined together.

The word modality simply means ‘the way, or mode, in which something happens, is experienced, or is captured.’ Therefore, multimodal means ‘experiencing or capturing something in multiple different ways.’

Humans experience the world in a way that is inherently multimodal. This is because we use all five senses (sight, sound, smell, touch, and taste) to gather information about a situation and interpret it. During a conversation with a friend, you will look at their appearance and their body language, listen to their words and the tone of their voice, smell whether they are wearing perfume etc., and potentially hug or engage in another form of physical contact.

All that information is taken in as ‘data’ by the brain and used to help you reach a conclusion about how that person is feeling and decide how you should react. If we relied on only one mode of data (i.e., only one sense), it is likely that we would reach less accurate conclusions. For example, everybody has experienced a situation where a friend or colleague has misinterpreted something ‘you said’ when written in a text or email because they did not hear the tone of voice which would normally give a crucial clue about how your words should be interpreted.

The same logic applies in the context of AI: multimodal data means data captured in multiple different formats and joined together to reach a conclusion. The difference is that while the human brain might be taking in data in the form of sights, smells, and sounds, AI algorithms are more likely to be taking in data in the form of text, images, audio files, or videos.

For instance, every time you see a picture with a caption on social media, such as Instagram, this is an example of multimodal data because it is image + text.

In the context of healthcare, multimodal data might include:

data collected from a person’s electronic health record (EHR)
an image captured by an X-ray
the radiologist’s written description of the X-ray
blood test results

Human clinicians will usually use all this data to reach a diagnosis. Relying on only one data input (for example, only the blood test results) might result in an inaccurate diagnosis. So AI algorithms designed for healthcare should also be multimodal, combining data from multiple sources in multiple formats to produce more accurate results.

The word modality simply means ‘the way (or mode) in which something happens, is experienced, or captured.’ Therefore, multimodal means ‘experiencing or capturing something in multiple different ways.’ This means that multimodal data is just data captured in multiple different formats e.g., image, text, EHR, video, audio, genetic, or self-reported (such as questionnaires). The UK Biobank, for example, contains patient-level genomic data, imaging data, and data from EHRs and questionnaires.

Multimodal machine learning is a sub-type of machine learning intended to develop AI models that combine the insights from multiple types of data (i.e., multimodal data). It’s thought that multimodal models will produce predictions and outcomes that are more accurate and more robust. This is because no single type of healthcare data can provide all the information needed to make an accurate diagnosis, prediction, or treatment recommendation.

An MRI scan, for example, might give an indication of the presence of a tumour, but it will not provide information on what kind of tumour it is (a biopsy and subsequent pathology data would be required for this), nor what medications the patient is already on, nor how old or what ethnicity that patient is. Without this additional information, it would not be possible to determine the exact type of cancer, therefore it would not be possible to give a predicted prognosis, nor to identify the most appropriate form of treatment.

Data fusion

At the core of any multimodal machine learning model development project is a process of data fusion. Data fusion is the mechanism by which the different types of data (the different modalities) are linked (or fused) using machine learning or deep learning. There is no single agreed-upon technique for completing the process of data fusion, and the options available range from the relatively simple (such as concatenation or weighted sum) to the more complex (such as attention-based recurrent neural networks or graph neural networks).

The choice of fusion strategy is of paramount importance and often dictates whether or not the multimodal project will succeed. The choice must be made carefully and will likely depend on the exact modalities being dealt with, the intended application of the model, and where in the model development pipeline the fusion process is taking place: early, intermediate, or late.

‍Early fusion, also known as feature or data level fusion, involves a process of joining together two or more input modalities into a single feature vector and then using this vector as the input for a single machine learning model. Techniques used here are relatively simple, including pooling or concatenation.
Intermediate fusion, also known as joint fusion, involves the extraction of features from different modalities in a stepwise process, with the learned feature representations extracted from different modalities in different layers of a neural network being joined together to form a single input to the final model.
Late fusion, also known as decision-level fusion, involves the training of separate models – one for each modality – and then combining the outputs of multiple models into one prediction. Techniques used here are the most complex, and often involve ensemble methods.

The results of multimodal learning can be spectacular. However, it does pose several issues.

The challenges of multimodal learning

Multimodal models are more complex, more computationally expensive, and less interpretable.
It can be difficult to simultaneously ‘reduce the complexity’ in each of the modalities to make them more harmonious without losing the relevant detail contained within each.
The more information collected and collated about an individual person, the easier it is to re-identify them, thus multimodal models present a particularly acute privacy challenge. Fortunately, none of these challenges are insurmountable. Federated learning can help with the privacy challenge for example. It is simply important to be aware of the challenges so that any associated risks can be proactively mitigated.

An Owkin example

In August 2020, Owkin published a paper in Nature describing HE2RNA, a multimodal model capable of predicting RNA-sequence profiles (i.e., the molecular features) from histology slides without the images of the slides needing to be annotated. The paper describes how this can be used to help with the identification of specific tumour types.

Browse our A-Z

A-Z of AI for Healthcare

Back to A-Z

A-Z of AI for Healthcare

Back to A-Z

Multimodal data

Data fusion

The challenges of multimodal learning

An Owkin example

Further reading

A-Z of AI for Healthcare

A-Z of AI for Healthcare