Predicting gene expression with machine learning

Duration:16 mins

Tags: Histogenomics


Date:December 10th, 2020


Predicting gene expression with machine learning

What is precision medicine?

Determining the clinical diagnosis and prognosis of diseased patients can be hugely complicated.  This challenge is mainly due to the enormous number of different biological changes (or heterogeneity of changes) in patients’ bodies associated with disease development.  Researchers have studied this complexity heavily over the past decade, especially concerning oncology.

Recently, a better understanding and characterization of this heterogeneity has led clinicians and researchers to consider a tumor unique and specific to each patient. Moreover, this awareness has given rise to a field of research known as precision medicine. Using precision medicine, a clinician determines his patient’s treatment based on a better understanding of his patient’s tumor’s specificities.

How cancer cells form

Cancer cells are abnormal cells that arise due to the accumulation of mistakes in a patient’s DNA. A good analogy to describe DNA is to view it as a recipe book for cells. 

  • The ingredients are the genes. Genes are DNA segments composed of 4 basic units (A, C, G and T) called nucleotides.

  • The cookies are the proteins. The genes (or ingredients) encode proteins, which are the functional units of cells. 

The extremely long DNA molecule tightly compacts to fit into a cell. Importantly, as in any recipe, mistakes happen every time one makes a copy. These mistakes will therefore disrupt the recipe and alter the cookies’ taste or the functionality of our proteins.

Using Genomic Analysis to characterize mutations in cancer cells

DNA Sequencing to predict gene expression

One way to characterize the unique biological aberrations in cancerous cells is to study the deleterious modifications that occur in the DNA sequence (known as mutations). These mutations lead to the expression of error-containing genes. Which, in turn, gives rise to non-functional proteins. We call this analysis technique, DNA sequencing. See our upcoming blog on ML models to predict cancer mutations. 

RNA Sequencing to predict gene expression

Another way to decipher the biological changes responsible for developing a given cancer is by analyzing the level of expression of all the genes (aka transcriptome) in cancerous cells and comparing their expression to both patients who have the same cancer and to healthy patients. We call this analysis, RNA sequencing.

We call such analyses genomic analysis. They are costly and, therefore, only done routinely in some research hospitals but not worldwide. Genomic Analysis also requires specialized researchers (known as bioinformaticians) to interpret the results and define the most expressed genes in cancer patients compared to healthy people. 

Machine Learning & Genomic Analysis

Overcoming the difficulty of acquiring genomic data

Researchers sought several solutions to overcome the acquisition limitations of genomic data.  Mathematicians and data scientists developed an unexpected one. They reasoned that as DNA is composed of 4 letters (an alphabet) to make up genes (words), parallels must exist between language processing and genome interpretation. This prompted research into language analysis methods to see if they could be used to understand genomics. However, the language of DNA is truly complex and integrates a significant number of different variables.

This is where machine learning (‘ML’) enters the equation. ML was created by mathematicians and computer scientists to train machines to learn highly complex tasks that can estimate what is going to happen in the future. These ML led estimated predictions of the future have a precise use in healthcare. 

1. How can ML help predict gene expression? 

The DNA in each human cell contains all the information needed to express and synthesize 20,000-25,000 genes. However, all cells don’t express every gene simultaneously at the same level. Moreover, when you perform RNA sequencing of cancer cells, you do not usually sequence just 1 cell but several hundreds of thousands of cells in bulk. RNA sequencing provides researchers with a lot of complex information. This amount of knowledge is hard, almost impossible, for a human brain to process altogether and process in a lifetime. So, ML models use genomic data to estimate several aspects of gene expression and regulation. Sometimes basic statistical methods are enough, but occasionally the complexity and the hidden interactions between the different gene expressions require the more sophisticated algorithms of the modern deep learning approach. 

Initially, researchers developed AI models used in healthcare to predict the association between gene expression and disease development. In oncology or hereditary diseases (Alzheimer’s and Parkinson), disease development depends on the deregulation of specific genes’ gene expression levels. However, we still haven’t been able to identify all the genes implicated in each disease. Discovering these genes is vital to help researchers find novel therapies to cure affected patients. 

While data scientists were exploring ML’s use on genomic data, biologists came with another way of facilitating genomic data acquisition known as targeting sequencing

Targeted Sequencing

This technique involves sequencing only a few genes known to be involved in a specific disease and not the full DNA sequence. This method is cheaper and faster than sequencing the entire genome but also limits the quantity of information. From this shortlist of reference genes, researchers built ML models to predict whole-genome expression to fill in the gap. Genes express in a coordinated manner so gene expression levels of different genes highly correlate. This correlation means that one can infer all genes’ expression levels directly from the profiles of a subset of genes (targeted sequencing). This has the potential to reduce the cost and complexity of gene expression profiling. 

Such use of deep learning models applied to gene expression data is not limited to human cells. Indeed, similar tools exist to hypothesize a variety of biological events in bacterial cells. For instance, by identifying the gene expression differences between bacteria strains, researchers can explain their different sensitivities to antibiotics. Researchers could also model the bacteria cellular response to low oxygen environments to understand how they will adapt to diverse environments. 

2. How can ML help predict gene expression regulation?

To add complexity, we discussed earlier that gene expression varies between cells and within a given cell during its lifetime and environment. How? Regulatory regions, localized around genes, control the timing and level of gene expression. Access to these regulatory regions within the DNA is managed, in part, by altering how tightly compact the DNA is at the regulatory region’s location. 

Are you getting confused?

Just imagine DNA as a thousand dictionaries compacted and folded together. To get access to some words (or genes) written inside this conglomerate, you will need to unfold it and read the letters (or nucleotides). This same mechanism exists to read genes and get access to their regulatory regions.

Regulatory regions are specific DNA sequences that do not code genes. They are recognized by several cell elements known as transcription factors. These factors are the regulators that decide on the timing and level of expression of a given gene. We call this process gene expression regulation.  As you can imagine, it is a highly complex process dependent on many variables that is hard to predict. 

ML models, however, can help predict the level of gene expression at a specific time by learning how to estimate the folded status of the DNA structure and which transcription factors are bound to the regulatory elements. 

Another genomic mechanism developed during the evolution of many organisms allows cells to create several proteins from the same gene. This mechanism is a unique biological process causing the expression of several genes from just one DNA sequence. How? Because some parts of the gene are occasionally skipped or spliced from the sequence forming several different isoforms (different forms) of the same sequence (as seen in figure 1). We call this mechanism gene splicing

Figure 1 – Gene splicing

Researchers developed ML models to predict the level of expression of each isoform of a specific gene in a given cell at a particular time in its lifetime. It is crucial to understand isoform expression because each isoform can be involved in a different cellular function and have a diverse impact on disease development. 

These ML models apply to fundamental biology and in understanding how genes can vary. However, more practical use of ML exists using genomic data to predict patient outcome and prognosis.

Use gene expression to predict clinical prognosis and outcome and aid decision making

A useful cancer prognosis estimates a cancer’s fate, the probabilities of its recurrence and progression, and the patient’s survival. ML models exist to help clinicians define patient prognosis by classifying patients into subgroups. Indeed, as we explained above, each patient has a unique and specific cancer at the molecular and cellular level. Clinicians study these specificities to group patients together in subgroups (or subtypes) to better understand their disease’s common aspect and how to treat it. Accurate cancer prognosis greatly benefits the clinical management of cancer patients as it helps predict the most beneficial therapeutic type and the timeline for that patient. Clinicians use ML models for such tasks in many cancers.

For example, models exist to define genes highly expressed by patients with a specific blood cancer (known as multiple myeloma). Patients expressing higher expression of these genes were more likely to have a recurrence and worse prognosis. The development of such models is only beginning with many exciting similar results that will help clinicians come. 

1. Predict gene expression from biopsy slides?

In oncology, biopsies (pieces of the tumor extracted from a given patient) are routinely performed in hospitals and research centers and are cheaper than genomic data analyses. They confirm the diagnosis of a patient and define their exact cancer type. Researchers cut these pieces of the tumor into smaller pieces and stain them to mark the cells’ different parts (the nucleus in blue and the rest of the cellular space in pink). These pieces (later named whole-slide images) are then analyzed and observed under a microscope by a specialist known as a pathologist. 

For decades deep machine learning algorithms have been used to analyze images of the real world (such as cats and dogs). Recently, researchers use these tools to analyze biopsy images and help pathologists with their diagnosis and prognosis tasks. For example, AI models exist to predict the malignant status of skin biopsies or dermatology images.

Such models were already impressive and exciting, but researchers are now taking advantage of both worlds: combining genomic and imaging data to build models that can directly predict molecular information from biopsy images.

Let’s look at an example in breast cancer.

Research has described four primary subtypes of cancer over the last ten years. These depend on the level of expression of several hormone receptors and the cellular structure that allows cells to detect their environment’s hormone level. As you might know, such hormones are essential because they control breast development during pregnancy. However, elevated expression (overexpression) of these receptors is associated with the uncontrolled proliferation of breast cells and leads to cancer development. It is vital to know what type of hormone receptor is overexpressed by a patient because specific therapies exist to target each hormone receptor type. For example, clinicians use Tamoxifen to treat a patient expressing the Estrogen Receptor (ER), a drug that explicitly blocks this receptor. However, clinicians use an alternative therapeutic approach, Herceptin, to treat a patient expressing HER2.

Based on such revolutionary use of AI in healthcare, we, and others, decided to go further and developed a model capable of predicting gene expression from whole slide images in 28 cancer types. Read our blog on HE2RNA. This model could better visualize a given cancer’s molecular status and predict patients’ clinical outcomes without any genomic data. This model is an incredibly powerful new tool because, as we explained above, biopsy images, but not sequencing data, are routinely obtained in almost any hospital. Our model could transfer knowledge gained from the genomic data in one hospital to improve prognosis tasks in another hospital.

2. Predict gene expression from radiology images

One limitation of using biopsy slides for a personalized medicine approach is acquiring tissue samples through invasive surgery. Other tools routinely used for patient diagnosis, prognosis and outcome are imaging (such as Computed Topography (‘CT’) scan, radiology, echography). These images are essential. They offer a unique opportunity to have a more comprehensive view of a given organ affected by a disease or an entire tumor without being invasive. Radiographic technologies offer prominent images to understand a cancer’s heterogeneity and monitor it’s response to treatment. Such an anatomic overview of the entire tumor is not possible for genomic data and biopsies. This lack of an anatomic overview is because such analyses focus on a specific portion of the tumor extracted during the biopsy/resection procedure. 

AI tools exist to extract radiomic feature information from radiology images to predict patient outcome and prognosis. These features reflect a wide variety of image parameters captured by the algorithm to differentiate different imaging phenotypes. A well-known application of such models is the prediction of MRI-based brain tumors. Low and high-grade tumor classification is a significant task for clinicians as they use it to decide their patients’ treatment. However, such subtyping can sometimes be challenging depending on the tumor’s location, size, or tumor ratio to normal tissues.


As for biopsy images, clinicians decided to associate imaging features and genomic data, creating an innovative approach known as Radiogenomic. This systematic association between imaging traits and gene expression offers useful insights from both directions.  Imaging features predict gene expression, and on the other hand, gene signatures can help investigate visible radiomic characteristics. Many ML models exist for this purpose. Recently, scientific publications demonstrated that CT scan images can predict the level of immune cell infiltration within lung tumors. This discovery is a significant breakthrough because tumors with high immune infiltration are more likely to be responsive to immunotherapy, an innovative therapeutic approach extensively used in oncology.


In this blog post we have discussed the different uses of ML in predicting gene expression. At Owkin, we believe that a combination of AI and collaborative research is key to unlocking new discoveries and advancing medical research. Our lab team’s expertise covers all types of data modalities, ML as well as biological and clinical capabilities.