Blog 2: The importance of data in building an AI diagnostic

Artificial intelligence relies heavily on algorithms – a set of computer instructions to accomplish a task. But the power actually comes from the data. Here’s why:

The type of AI we use for analyzing pathology images is called machine learning. Just as pathologists study and learn from reviewing many tissue samples, machine learning finds patterns in large, detailed images and learns to associate the patterns with different labels. A label can be used to group patients sharing a common characteristic, like a particular biomarker, or it could group a set of pixels assigned to the same type of object, like tumor tissue.

Machine learning model groups patterns with different histological features

To learn this, model training requires data in the form of whole slide images (WSIs). WSIs are also needed for validating a model. Subsequent blogs will cover these topics. In this one, we’ll focus on where the data comes from and what’s needed to prepare it for machine learning.

Sourcing training data

The greatest challenges in developing a machine learning solution are typically not in building a model but in working with the data. The data must be gathered, stored, cleaned, and annotated before it can be used to train a model. Through each of these steps, data quantity, quality, diversity, and security must be considered.

Let’s draw on Owkin’s MSIntuit^Ⓡ CRC diagnostic, a prescreening model for microsatellite instability (MSI), as an example. The details for other diagnostics will vary somewhat, but the general steps and considerations are the same.

Learn more about MSIntuitⓇ CRC

‍

Data is commonly sourced from university and hospital partnerships or, sometimes, biobanks. To train MSIntuit ^Ⓡ CRC, 859 WSIs from 434 patients from the Colon Adenocarcinoma project of The Cancer Genome Atlas (TCGA) were used. These slides were collected from 24 different medical centers in the US. Roughly half were Formalin-Fixed Paraffin-Embedded (FFPE), and the other half were snap-frozen. Additionally, 600 slides from Medipath pathology laboratories in France were used to validate the model.

Generally, the more training data the better. But it’s not quite that simple because the quality and diversity also matter.

Slide preparation and scanning

Once slides are digitized, they must be prepared for use by a machine learning model. At 20x magnification, WSIs can be up to 100,000 x 100,000 pixels across. Just as a pathologist focuses on a single field of view at a time, the model also processes a single tile at a time. (We will explore tiling in detail in blog 3).

The tiles are first preprocessed to discard regions that are out-of-focus or contain artefacts such as dust, tissue folds, pen markings, etc. All artifact types can degrade model performance. They can obscure the relevant morphology or even introduce a bias if their prevalence is confounded with the variable of interest.

The details of preparing a dataset of whole slide images may not be fully known upfront and the number of training images needed to create a reliable model is typically difficult to predict before a model is trained. Experimentation is required to learn about the data and its challenges. New variations in tissue appearance may be found that require an additional step of preprocessing.

Doctor Frederik Deman, Pathologist at ZAS, Belgium, said that:

When we deliver our results, we receive valuable feedback in return. It’s not just a one-way street of giving them data, but rather an exchange of information. For instance, Owkin has helped us identify very small artifacts in our slides, thanks to this information, we can now improve the quality of our slides.

After discarding artifacts and tiles without tissue, the remaining tissue tiles can be used for model training and validation. The MSIntuit^Ⓡ CRC model uses a random subset of up to 8,000 tiles from each WSI.

Tissue preparation and scanner variations also produce differences in the color of the digitized slide. These color variations always need to be considered prior to model training so that the model will be robust to these changes. In some cases, gathering WSIs from a variety of labs and scanners is sufficient; in other cases, these variations may be simulated or the stain color simply normalized computationally to standardize the appearance of hematoxylin and eosin.

Slide labels for model training

Labels are needed to use WSIs for model training. The labels enable the model to associate the patterns it finds in the images with different classes. Slide labels could be assigned by a pathologist, for example, tumor grade, they could come from clinical data, such as whether a patient relapsed in the subsequent five years, or they could be generated from a different type of analysis. The labels for training and validating MSIntuit^Ⓡ CRC were obtained from MSI-PCR, one of the methods of determining a tumor’s MSI status.

The image labels should be as accurate as possible to avoid confusing the model. When WSIs are annotated manually, all annotators follow a well-defined procedure to get consistent labels. When molecular analysis is used to obtain the labels, intratumoral heterogeneity adds a new challenge. This is because the PCR label uses different tissue from the WSI. In the case of intratumoral heterogeneity, the PCR label may not match the tissue in the WSI.

A histology slide contaminated with written markers. Source

Dataset diversity to mitigate bias

The availability of quality images and labels is a fundamental prerequisite for model training. he diversity of this dataset is equally important. Machine learning models learn patterns from the training images; however, when presented with previously unseen patterns, they will likely fail to predict the correct output. Therefore, for the model to be robust, it must be trained on images that represent the full diversity of tissue that it will encounter when in use.

Learn more about Generalizability, a way to determine how well an algorithm works in a new setting.

‍

Diversity includes both clinical and technical variations. Technical variations encompass the staining and scanner differences mentioned above, as well as any other tissue preparation and digitization factors that may contribute to image appearance. Considerations for clinical diversity may include gender, age, race, histologic subtype, grade, or other variables. Tissue appearance can be dependent on some or all of these factors, so the model should be exposed to this diversity during training and validated on these groups. Otherwise, it may be biased and not perform reliably on some subgroups of patients.

Patient privacy and data security

Patient privacy is critical when working with WSIs and associated data. Prior to training models, all data is anonymized by removing any identifying information. However, clinical information like the patient’s diagnosis, race, gender, and age may be retained to better validate the model across different subgroups. In storing and processing this data, Owkin also follows applicable laws and regulations, particularly GDPR. Owkin diagnostic processing is done with cloud providers in the European Union.

Data security is paramount when handling patient data. Owkin performs internal and external audits regularly to assess security risks and stay current with security and privacy threats. We are also ISO27001:2013 and ISO13485:2016 certified and committed to protecting patient data while ensuring the high quality and safety of our products.

Summary

While dataset quantity and quality are critical considerations, not all challenges are known upfront. Model development is iterative. Early iterations use smaller and less diverse datasets to prove the concept. Through multiple steps of data preparation, model training, and validation, changes are made to the data and model processing until a sufficiently accurate model is achieved.

We’ll start looking at these subsequent steps in the next blog on how models are built.

‍

Authors

Heather Couture

Hortense Lucas Deslandes

Testimonial

Doctor Frederik Deman

Pathologist at ZAS, Belgium

"We currently have two ongoing projects with Owkin. One involves sharing data and digitized slides, while the other aims to study our workflow. AI can be expensive and uncertain, but we are willing to take the risk because we believe it will bring significant benefits to the field of pathology."