PyDESeq2: a new tool in the Python omics ecosystem
Bulk RNA sequencing (RNA-seq) is a technology widely used to study gene expression at the transcript level. One of the key steps in RNA-seq data analysis is differential expression analysis, which aims to identify genes that are differentially expressed between two or more conditions. When bulk RNA sequencing (RNA-seq) was introduced, most bioinformatic software was developed in the R programming language. Therefore, several of the popular software tools for bulk RNA-seq differential expression analysis are written in R, such as DESeq2, edgeR, and Limma. These tools have been extensively tested and validated, and are widely used in the scientific community.
As the field has grown, so have the programming languages that support this kind of analysis. One of these is Python, a programming language for scientific computing and data analysis with a large and active community of users and developers. Python offers several advantages, such as several libraries and tools for handling large datasets, statistical analysis, and visualization, making it a natural choice for bulk RNA-seq data analysis.
Today, there is no available python-native package for state-of-the-art differential expression analysis with generalised linear models on bulk RNA sequencing data. Researchers who prefer to use Python for their data analysis workflows can use workarounds, but with the two programming languages being inherently different there can be challenges in integrating R and Python workflows.
Introducing PyDESeq2: a Python implementation of the popular R package DESeq2
To support the growing community of Python users in the field of bioinformatics, a team of data scientists at Owkin, led by Boris Muzellec, developed PyDESeq2, a python implementation of the bulk RNA-seq DEA methodology introduced by Love et al. (2014) based on the R package DESeq2.
PyDESeq2 allows users to perform DESeq2 analysis in Python, rather than having to switch to R for this step. It provides a Python alternative to the DESeq2 functions, allowing users to perform normalization and differential expression analysis in Python.
PyDESeq2 is designed to be easy to use, with an API similar to DESeq2 in R, and with a focus on reproducibility and ease of integration into existing Python workflows. PyDESeq2 has several advantages over using DESeq2 directly in R, including the ability to easily integrate with other Python libraries and tools, and the ability to use Python's parallel processing capabilities for efficient analysis of large datasets.
PyDESeq2 is a useful tool for researchers who prefer to work in Python or who have existing Python workflows in which they would like to integrate RNA-seq analysis.
Implementation of PyDESeq2
PyDESeq2 is based on the DEA methodology developed by Love et al. (2014) to analyze gene expression data. This method involves modelling raw counts using a negative binomial distribution and estimating dispersion parameters for each gene using a negative binomial generalized linear model (GLM). These parameters are then adjusted towards a global trend curve to improve accuracy.
PyDESeq2 offers a range of features that correspond to the default settings of DESeq2. These features include DEA for single-factor and bi-level multi-factor designs using Wald tests, and LFC shrinkage using the apeGLM prior (Zhu et al., 2019). The software package contains two classes of objects: the DeseqDataSet class, which handles data modelling steps from normalization to LFC fitting, and the DeseqStats class, which performs statistical tests and, optionally, LFC shrinkage.
PyDESeq2 relies on the widely-used Python packages scipy (Virtanen et al., 2020) and statsmodels (Seabold and Perktold, 2010) to fit GLMs, making this software a powerful and flexible tool for researchers using python to study gene expression data.
PyDESeq2 performance is comparable to DESeq2
In our paper we put PyDESeq2 and DESeq2 to the test on eight datasets from the Cancer Genome Atlas (TCGA). Our goal was to ensure that PyDESeq2 performance was comparable to the one of DESeq2. We compared the results of these two tools when analyzing differential expression between tissue samples corresponding to advanced and non-advanced tumor grades, based on TCGA's clinical data. We focused on four key criteria: retrieved genes, enriched pathways obtained with the fgsea package (Sergushichev, 2016), model likelihood, and speed.
The results, shown in Fig. 1, speak for themselves. PyDESeq2 returned similar sets of significant genes and pathways to DESeq2, with similar likelihood for dispersion and log-fold change parameters on most genes. The speed was also comparable, with PyDESeq2 performing slightly better on larger datasets and slightly worse on smaller ones.
If you're interested in learning more about our experiments, you can find detailed information in the supplementary material of our paper. All the data we used is publicly available on the TCGA Research Network website so that you can replicate our results.
Conclusion and future perspectives
By open-sourcing PyDESeq2 we hope to add a tool to the Python omics ecosystem, and to provide a convenient and efficient solution for researchers looking to perform differential expression analysis on bulk RNA-sequencing data in Python.
After the initial release, we are working to add new features to PyDESeq2. Some of the upcoming features include VST normalization, and extended support for multi-factor designs.
PyDESeq2 is released as an open-source software under the MIT license. It can be directly installed from PyPI. The source code is available on GitHub, and the documentation can be found here.
The PyDESeq2 package is also integrated within the scverse ecosystem.