Owkin's Jean-Philippe Vert shared his thoughts on how generative AI might disrupt drug discovery with Nature Biotechnology

Abstract

In the short few months since the release of ChatGPT1,2, the potential for large language models (LLMs) and generative artificial intelligence (AI) to disrupt fields as diverse as art, marketing, journalism, copywriting, law and software engineering is already being realized. These technologies use deep learning models trained on enormous amounts of data to generate new texts or images. While trained only to capture statistical regularities in the training data, their ability, once trained, to imitate human language in a convincing way; to generate realistic images, sounds or software; or to solve tasks apparently involving higher cognitive functions such as reasoning has caught the world by surprise. As such, they are also poised to disrupt in many ways how scientists and engineers understand biology and discover and develop new treatments.

First, existing LLMs are already able to act as extraordinary productivity tools that allow data scientists and engineers, including those working on medical research and drug discovery, to do their jobs more efficiently. Solutions like GitHub Copilot and ChatGPT are being rapidly adopted by software engi-neering teams to write high-quality code more quickly3, and data scientists are also increas-ingly generating plots and drafting reports and presentations with the help of AI-based assistants. Current LLMs can also help with more technical and complex tasks, such as solving the long-standing problem of data harmonization across multiple data centers, which remains largely dependent on human data processing. In particular, one increasingly popular approach to harmonize heterogeneous multicohort datasets is to synthesize samples that belong to a missing modality or domain using generative AI to see data harmonization as a style transfer problem4. By automating and streamlining the techni-cal procedures associated with integrating data from heterogeneous sources, LLMs and generative AI models will increase the growth of collaborative data networks, allowing AI models to be fueled by unprecedentedly large datasets. Furthermore, in addition to harmonization, synthetic data generation can solve the problem of anonymizing sensitive data when it comes with differential privacy guarantees5, thus providing a promising technical solution to grow collaborative data networks while preserving data privacy within each contributing partner6.

Second, with their abilities to generate not only texts and images, but also novel small molecules7, nucleic acid sequences8,9 and proteins10,11 with a desired structure or func-tion, deep generative models are increasingly used in drug discovery to quickly explore a broad space of candidate therapeutics and optimize them in silico for a given target or function. For example, Shanehsazzadeh et al.12 used a deep generative model to gene-rate variants of trastuzumab, a monoclonal antibody targeting human epidermal growth factor receptor 2 (HER2) used to treat breast and stomach cancer, and experimentally vali-dated three AI-generated variants with low sequence similarity to trastuzumab but better binding to HER2. Besides designing therapeutics, AI-based generative models of biological data have been used for other purposes such as accurate long DNA read sequencing13, to reduce the cost and increase the accuracy of DNA sequencing; or translation between single-cell genomics modalities14, to allow exploration of the multimodal diversity of omics within a tissue.

Third, LLMs and generative AI models can boost existing AI models and provide an exciting framework for the seamless integration of heterogeneous data and concepts. Indeed, a remarkable feature shared by most deep-learning-based generative models, including LLMs, is that under the hood they represent any type of data in a uniform way, namely, a list of numbers (a vector in mathematical terms), often called the embedding of the data15. For example, to answer a question, ChatGPT first converts it from a text to a vector embedding, and then generates an answer as a function of that embedding.

Representations learned by modern generative AI systems such as transformers for text data16 or graph neural networks for small molecules17 are extraordinarily powerful at capturing the information needed to generate a meaningful text or a relevant molecule, but they can also serve other purposes. In particular, through their ability to represent com-plex data as vectors, LLMs and generative AI models can serve as powerful sources of prior knowledge about the data that can be used to improve the performance of other machine learning systems. This is already happening in the field of neuro-symbolic representation learning, where a representation of genes or diseases is learned with deep representation learning from a knowledge graph that encodes a multitude of data about biology, and the learned representation is then used by more standard AI models to predict properties of genes or to infer gene–disease associations18. I anticipate that more applications of these ideas will emerge to improve AI models for diagnosis, prognosis or treatment response prediction from patient data.

Indeed, most patient data are very high dimensional (think of the millions of descriptors one typically uses to represent a medical image, a molecular profile or an electronic health record), and training accurate AI models from these data necessarily involves constrain-ing the space of models with prior knowledge about the nature of the data. With their ability to capture complex and context-dependent representations of biological concepts, LLMs provide a guide for training AI models and making them more accurate and robust. How exactly to implement this idea and how effective it will be largely remain open research questions, but simple approaches like trans-ferring the representation of genes or diseases learned by an LLM to omics-based machine learning models is a promising direction.

Fourth, it is tantalizing to think that LLMs’ potential stretches far beyond even the complex technical tasks described above. Will they soon be acting as powerful assistants to scientists — or even actual scientists in their own right? With their ability to store knowledge extracted from large amounts of data, including the scientific literature and internal research documents, LLMs may be able to reason and generate scientific hypo-theses and discoveries, just as scientists do. Once LLMs become more sophisticated, there Check for updates nature biotechnology is hope that we will be able to ask them pertinent research questions, such as, “What would be a good novel drug for this group of patients with unmet medical need?”

Existing LLMs are far from mature enough for such tasks, though. In spite of promising results on many benchmarks, Galactica, an LLM for science, survived only three days online19, and while ChatGPT quickly became a popular tool on the web, it is, like all LLMs, notorious for its tendency to ‘hallucinate’ — that is, to invent facts not grounded in data nor following any logical deduction20. This is a major issue in scientific research, and whether or not this can be fixed in the future is a heated debate in the AI community21. To address this issue, many efforts are ongoing to develop so-called augmented language models (ALMs) that combine the flexibility and scale of LLMs with additional mechanisms to improve their reasoning and reliability22. One mechanism of particular interest for science is to equip an LLM with the ability to automatically query and retrieve relevant information from a database in real time, which helps it generate text grounded in the real information from the database.

Interestingly, the database used to guide ALMs can in principle contain a large diversity of data, which opens many exciting opportunities for biomedical applications. For example, we may want to augment an LLM with a knowl-edge graph encoding all the knowledge we have about genes, diseases, drugs and their interactions in order to ground the text gener-ated by the LLM in that knowledge. Another fascinating direction would be to augment an LLM with the ability to query multimodal patient data when it answers questions and generates hypotheses. This could allow it to generate hypotheses grounded not only in scientific knowledge but also in patient data and could enable the automated discovery of subgroups of patients likely to respond to a novel putative treatment.

ChatGPT represents a landmark moment in the use of AI to disrupt — and hopefully impact positively — humanity. While technologists, ethicists and regulators frantically debate the lasting effect of technologies like LLMs, it is becoming clear that drug discovery and development will be transformed. By auto-mating time-consuming tasks, generating novel molecules and hypotheses, boosting the performance of existing predictive models and acting as a supercharged research assistant, existing generative AI models have already proven their transformative potential. In the future, more advanced LLMs will likely go even further and fundamentally alter the way in which we use AI in drug discovery and medical research. As in other fields, however, LLMs raise numerous ethical, legal and safety questions23. Besides risks of misinformation harms already mentioned above if the model hallucinates, deploying these solutions in the pharmaceutical and medical fields requires that we be careful about other risks, such as information hazards associated with leakage of private information, as well as discrimi-nation if LLMs reinforce biases present in the data they are trained on. While there is currently no simple solution to mitigate these risks, we should at minimum be fully transparent about how models are built and validated and report this systematically using templates like model cards24 to ensure that scientific research remains grounded on solid foundations and that medical progress benefits all.

View publication

Authors

Jean-Philippe Vert, PhD

More like this

Cancer

CARMIL: Context-Aware Regularization on Multiple Instance Learning models for Whole Slide Images

Thiziri Nait Saada

et al.

Proceedings of Machine Learning Research

Research

Pacpaint: a histology-based deep learning model uncovers the extensive intratumor molecular heterogeneity of pancreatic adenocarcinoma

Charlie Saillard

et al.

Nature Communications