Predicting gene expression using our novel genomic analysis tool: HE2RNA

Duration:16 mins

Tags: ML


Date:August 3rd, 2020


Predicting gene expression using our novel genomic analysis tool: HE2RNA

  1. Owkin has published in Nature Communications its exciting new genomic analysis tool (HE2RNA) to predict gene expression from histology whole slide images.

  2. This genomic analysis tool has the potential to greatly facilitate patient diagnosis and improve the prediction of response to treatment and survival outcomes. 

  3. HE2RNA is  interpretable by nature and can help pathologists:

    1. Predict genes involved in cancer development; and

    2. Predict tumor status and response to therapies

HE2RNA Spotlights – See what key industry leaders are saying about HE2RNA

Eric Topol, Stanford AIMI Symposium, AI In Medicine and Imaging, Aug. 5, 2020 (video)

The HE2RNA study uses deep learning for predicting RNA-seq in tumors. It basically gives the probability of what’s the transcriptome signature. It’s really extraordinary: these are things that human eyes cannot see, and, in fact, oftentimes, pathologists can’t even agree on the slightest interpretation; so this is a big boost.

The Cancer Letter, Oct. 30, 2020

Owkin’s model is the latest product in a diagnostics race at academic institutions and health IT companies to create deep learning algorithms that can be used to improve cancer screening, early detection, and clinical decision support.

Genome Web, Aug. 21, 2020

Although numerous academic groups and companies are exploring the implementation of machine learning or AI to improve and better standardize histopathology, Owkin has taken this a step further than most, showing in a recent study that its approach can not only improve morphologic analysis or visual biomarker detection, but can predict or recapitulate a cancer’s gene expression levels without the need to actually sequence DNA.”

From Data to Nature

Our team at Owkin, has worked on a novel machine learning (‘ML’) model (a file that recognizes certain types of patterns) called HE2RNA to predict gene expression from histology slides since October 2018. Nature Communications have published the results of this year-long project.

In this article, we introduce you to this model and explain some of the potential clinical applications. Additionally, we’ve made an adapted version of our HE2RNA model available on our research-friendly AI platform, for you to try out with your own datasets. (Alternatively, you can visualize and explore the results from this paper via our HE2RNA demo.) 90% of the data used to train and test this model was from The Cancer Genome Atlas (‘TCGA’), a publicly available dataset, with the remaining data sourced from our partner hospitals. We are actively exploring new datasets to expand the scope and improve the performance of this model. 

An introduction to our Genomic Analysis Tool (HE2RNA)

Gene Expressions is the process of converting instructions in our DNA into a functional gene product (such as a protein or messenger-RNA (‘mRNA’). Cells become cancer cells largely because of gene mutations, or, permanent alterations in the DNA sequence that change the function and expression of a given gene or due to changes in the cell environment that directly affect the expression of certain genes. Studying changes in gene expression can help characterize a specific tumor. Additionally, comparing the gene expression of cancer patients with healthy patients can help to decipher the biological changes responsible for the development of a given cancer. 

Genomic analysis is the study of DNA sequences for mutations or mRNA sequences for gene expression. Whole transcriptome sequencing techniques (RNA-Seq) and dedicated bioinformatics tools can identify gene expression in cancer. However, these costly and time-consuming analyses require specialized researchers (known as bioinformaticians) to interpret the results. Therefore, not all medical centers routinely use them.

An introduction to Histology

In oncology, biopsies (samples of the tumor extracted from a given patient) and resection (extracting larger portions of a tumor) are performed routinely in hospitals and research centers to confirm the diagnosis of a patient and define the exact cancer type. Following extraction, the biopsies are cut into smaller pieces, stained to mark different parts of the cells (the nucleus in blue and the rest of the cellular space in pink) and analyzed under a microscope by a pathologist. This analysis is called histology and the image samples are called whole slide images (‘WSIs’) (aka histology slides). 

In recent years, deep ML has had a tremendous impact on various fields in science such as improvements in speech recognition and image recognition. Recently, ML models have been applied to histology WSIs to improve the performance of pathologists in determining the diagnosis and grade of cancer patients. While it is becoming clear that the application of such models to tissue-based pathology can be very useful, few attempts have been made to connect specific molecular signatures directly to gene expression patterns within histology slides.

A ML model that can use these ubiquitously available histology slides to determine gene expression without the need for expensive sequencing techniques has the potential to be an incredibly useful clinical tool.

Our Genomic Analysis Model (HE2RNA)

In our recent paper, published in Nature Communications, we describe how we developed a model capable of predicting gene expression from WSIs. Therefore, this research has the potential to greatly facilitate patient diagnosis and improve the prediction of response to treatment and survival outcomes. 

Want to build your own model? Check out our interactive guide below.

Step 1 to building HE2RNA

To create your own algorithm and build your model you will need to obtain two kinds of data: 

Firstly – A dataset of several biopsy slides from cancer patients – To capture the heterogeneity between different cancer types; and

Secondly – Genomic data from the same patient cohort (group of patients) – To capture the gene expression profiles for each patient. 

Step 2 to building HE2RNA

Okay – Now that you have these two major ingredients, let’s make our recipe. 

As a first step, you will have to split each dataset (the biopsies and the genomic data) into two groups: a training set and a test set. 

  • The training set – This will be used by your model to learn which WSI is associated with the right gene expression profile. 

  • The test set – This will be used to test the model’s performance at predicting the level of gene expression for each WSI. 

High performance means that the model is able to associate the right gene expression profile to the right WSI more efficiently than a pathologist could by using his microscope.

However of course, as a biologist in training, you might be wondering the following questions:

  • What exactly did my model learn?

  • How can I understand and verify that it is learning the correct information?

These are essential questions that data scientists and researchers have to take into account and answer. The key to answering these questions is being able to interpret why our model made the predictions it did. This is interpretability and is a fundamental topic in ML.

Our genomic analysis model is interpretable

To describe the interpretability feature of our genomic analysis tool, HE2RNA we must rewind to step 2 explained above. Here we explained that the model learns to associate the correct WSI to the correct gene profile. 

How does the model learn to do this? 
  1. For each WSI, the model assigns a score for each gene in the genomic profile; and then

  2. With this score, we can calculate the performance of the model. We do this by analyzing how many times it correctly or incorrectly predicts each gene in all of the images. 

However, this gene score would be hard for clinicians to interpret for diagnosis purposes. (Do not forget this is our ultimate application). 

So to decipher the learning and improve the performance of our model, we developed a heatmap. This is a graphical representation of data that uses a color-coding system to represent different values.  We did this by:

  1. Virtually cutting each image into small squares (aka tiles);

  2. The model then allocates a gene score of expression to each tile of each slide;

  3. Each score associates with a color; and then

  4. The colors form a heatmap for each gene on each image.

This is a hugely useful genomic analysis tool for the clinician

It allows him to identify exactly where on the slide the gene expression occurred.

Figure 1. An example of an HE2RNA heat map

To test the accuracy of our interpretable heatmap, we selected CD3. This is a gene that encodes the CD3 protein expressed at the surface of the immune cells. CD3 is one of the genes that our HE2RNA model can detect. Researchers routinely use Immunohistochemistry to detect CD3. The tissue sample is stained with an antibody that recognizes CD3 specifically. We compared our HE2RNA heatmap of CD3 (figure 1.) to a standard CD3 stain of the same tissue (figure 2.) from a liver cancer sample obtained from one of our partner hospitals.  

We then compared the two images to calculate how many times HE2RNA was right or wrong. Not only was there a very low error rate but HE2RNA proved that it was able to predict the location of the expression of CD3 on a slide. It was able to do this even if it learned its global expression from a different part of the tumor. 

Figure 2. A CD3 stained slide

Now you might wonder: Great to have all this work done, and now what?

This is actually when the most exciting part of the project starts. This model has a multitude of potential applications given the number and broad role of genomic mutations in disease. Here we discuss two possible applications that we have explored.

HE2RNA Application – Prediction of genes involved in cancer development

Cancer cells are abnormal cells arising from uncontrolled growth and the unstoppable survival of cells. In the introduction, we explained that normal cells become cancer cells largely because of mutations that change the expression levels of certain genes. Using our novel genomic analysis tool, HE2RNA, we have demonstrated that we can predict the expression of these genes in many cancer types. 

Tumors must be invisible to the immune system to survive. To do so, tumor cells express genes that block the infiltration of immune cells (T cells) into the tumor. Additionally, if these immune cells succeed in entering the tumor cells, cancer cells also express genes to block the activation and action of these T cells. 

You might be wondering…

Why is it interesting to predict (i) if immune cells have infiltrated the tumor and (ii) if they are being inhibited by the tumor cells?

These are important clinical questions. Clinicians use the presence of inactive infiltrated immune cells as a biomarker to predict patient response to a novel therapeutic approach called Immunotherapy. This treatment aims to ‘flag’ the cancer cells to the immune system so they can be neutralized. It works by reactivating infiltrated and inhibited immune cells in the tumor environment. Detecting immune cell infiltration today is possible with the analysis of tumor biopsies, however knowing their activation status is tedious and requires the analysis of genomic data. HE2RNA can predict both immune cell infiltration and their activation status. This will be a useful tool for clinicians to decide which patients will benefit from immunotherapy.

HE2RNA Application – Prediction of tumor status and response to therapies

Our genomic analysis tool can predict the clinical status of colorectal cancer patients. Microsatellite instability (MSI) is a molecular status currently used to predict survival and response to immunotherapy treatment in patients with colorectal cancer. It is the condition that causes a high predisposition to mutations resulting from impaired DNA repair. 

When a cell divides, it must replicate its DNA. During this replication, some mistakes can appear. To avoid such mistakes, proteins control the newly synthesized DNA and fix any errors. This mechanism is known as the DNA repair pathway. Mutations in the proteins that control this pathway are very common in colorectal cancer (and other cancer types).

Microsatellites are tandem repeats of short DNA sequences throughout the genome. These regions are often miscopied and mistakes are usually fixed by proteins of the DNA repair pathway. However, mutations in the DNA repair pathway will result in an accumulation of mistakes in these microsatellite sequences. This molecular specificity is known as microsatellite instability (MSI). It is clinically relevant for clinicians because such mutations in cancer cells make them more sensitive to immunotherapy.

Our genomic analysis tool HE2RNA can predict MSI status

To test the robustness of its performance, we trained HE2RNA to predict MSI gene expression in a dataset of different cancer types from one partner hospital. We then tested the model using a second dataset from another hospital with only histology data (no genomic information). Strikingly, we demonstrated that even in this specifically challenging setting, HE2RNA was still able to predict patient MSI status.

This is outstanding as it highlights that the knowledge acquired by HE2RNA from the training dataset can be transferred to another dataset to answer specific molecular questions and to help the clinician make treatment decisions.

Try out our genomic analysis tool (HE2RNA) yourself

Now that you have learned everything, you might want to play by yourself and explore our discoveries: Go to our Studio Demo dedicated to our HE2RNA model to discover more about the cohort, the heatmaps, and the results.