Alban fact checks ChatGPT’s views on how generative AI will impact drug discovery

Duration:20 mins

Tags: AI / ML


Date:March 10th, 2023


Alban fact checks ChatGPT’s views on how generative AI will impact drug discovery

Much has been made of the potential for generative AI, including large language models like ChatGPT, to disrupt drug discovery. But does ChatGPT itself know the full extent of this? I recently asked it a simple question: “how can LLMs such as ChatGPT have the biggest impact on drug discovery and development and what are their potential limitations?” 

The results were broadly accurate – but they barely scratched the surface. Below, I fact check ChatGPT’s views and offer what I hope is a more detailed and nuanced take on the future of generative AI and drug discovery.

ChatGPT: Large language models such as ChatGPT can have the most impact on drug discovery and development in several areas, including target identification, lead optimization, and repurposing of existing drugs.

One of the main challenges in drug discovery is identifying new compounds that can target specific biological pathways or molecules. Traditionally, this has involved a process of trial and error, with researchers screening large libraries of compounds for activity against the target of interest. However, this process is time-consuming, expensive, and often yields low success rates.

Alban: I would go even further – the combination of data and AI has the potential to cut the average pre-clinical development time in half.

ChatGPT: In target identification, LLMs can analyze large amounts of genomic and proteomic data to identify novel targets for drug development. By identifying the specific genes or proteins that are involved in disease processes, LLMs can help researchers to develop more targeted and effective therapies.

Alban: The immediate and most obvious impact of LLMs is to greatly increase the speed at which the existing scientific literature can be analyzed in order to uncover unexploited avenues through improved knowledge graphs.

However, the potential of large transformer models is even greater when being applied to non textual data like the data modalities we focus on at Owkin. Generative models can significantly benefit the analysis of rich translational data like H&E and omics data, which have a massive image component.

Source: Images of drug discovery and development created by Sarah Ulstrup with the assistance of generative AI (Midjourney and DALL·E 2)

As I was writing this post, Meta released Galactica, a large language model that can store, combine and reason with scientific knowledge. This model, trained on literature, could potentially find hidden connections between different pieces of research and bring these insights to the surface.

It could synthesize knowledge by generating secondary content automatically, such as literature reviews, encyclopedia articles, lecture notes and more.

ChatGPT: In lead optimization, LLMs can help researchers to identify the most promising compounds for further study. By analyzing large datasets of chemical structures and associated biological activity data and metabolic pathways, LLMs can identify potential modifications that could improve its efficacy or reduce its toxicity. This can help researchers to more efficiently optimize drug candidates and increase their chances of success in clinical trials.

Alban: This is where companies that are specifically focused on in-silico drug design using an existing target, such as Aqemia, Abcellera, Schrödinger, or Insilico, can leverage LLMs.

ChatGPT: In drug repositioning, LLMs can help researchers to identify existing drugs that may have the potential for new uses. By analyzing large amounts of clinical and pharmacological data, LLMs can identify drugs that have been approved for one indication but may be effective in treating other diseases.

Alban: A fascinating direction would be to provide to the transformer models the possibility to query multimodal patient data (like the data on which we at Owkin train our models) when it generates text.

This could allow it to reinforce existing AI models that generate hypotheses grounded on patient data, and for example to be more efficient in discovering subgroups of patients most likely to have an optimal response to a given drug.

Source: Images of drug discovery and development created by Sarah Ulstrup with the assistance of generative AI (Midjourney and DALL·E 2)

ChatGPT: One of the main limitations is the quality of the data that LLMs are trained on. LLMs rely on large datasets of chemical structures and associated biological activity data to learn to recognize patterns and make predictions. However, if the data is incomplete or inaccurate, this can lead to errors and incorrect predictions. In addition, the bias in the training data can also limit the accuracy and generalizability of the LLMs.

Alban: However, LLMs can help upgrade datasets by enabling better data harmonization across multiple data centers, which currently remains largely dependent on human data analysis.

Furthermore, with their ability to capture complex and context-dependent representations of biological concepts, LLMs provide an exciting source of knowledge to guide how we train AI models and make them more accurate, robust, and based on causal interactions captured by the LLM representation.

It is worth noting that ChatGPT missed an opportunity here to state how LLMs could actually help with the challenge, rather than simply pointing out the limitations based on data quality.

ChatGPT: Another limitation is the need for expert domain knowledge to properly interpret the outputs of LLMs. LLMs can be complex and difficult to interpret, particularly when it comes to understanding the specific features or interactions that are driving their predictions.

Alban: As such, expert domain knowledge is necessary to properly interpret the outputs of LLMs and ensure that they are being applied appropriately. LLMs are notorious for their tendency to hallucinate, that is, to invent facts not grounded on data nor following any logical deduction, which is not acceptable in most scientific research.

To address this issue, many promising efforts are ongoing to develop so-called augmented language models (ALMs).

ChatGPT: Overall, LLMs have the potential to transform the drug discovery and development process by providing a more efficient and effective means of identifying potential drug candidates, optimizing their properties, and repurposing existing drugs. By leveraging the power of machine learning, researchers can more quickly and accurately develop novel therapies to treat a wide range of diseases.

Alban: As always when a new potentially disruptive technology comes to light, there is a period of hype followed by more muted comments. It is often during that period that the early adopters go to their garage and try to ‘make it work’ and optimize. This is what we already started doing at Owkin, recognizing the great potential of LLMs but also their limitations.