FedPyDESeq2: a federated framework for bulk RNA-seq differential expression analysis
Abstract
Despite opportunities for deeper clinical insights, large-scale transcriptome studies are often limited by data silos and risks of privacy leakage. While meta-analysis can be employed to aggregate local results, this is at the price of lower statistical power, especially in heterogeneous settings.
A recent paradigm in distributed computing, federated learning (FL) is a means of fitting models from siloed data, while ensuring that private data does not leave its storage facilities. Here, we introduce FedPyDESeq2, a software for differential expression analysis (DEA) on siloed bulk RNA-seq. Building on FL tools, FedPyDESeq2 implements the DESeq2 pipeline for DEA on siloed datasets in a privacy-enhancing manner.
We benchmark FedPyDESeq2 on datasets from The Cancer Genome Atlas corresponding to 8 different indications, split by geographical origin. FedPyDESeq2 achieves near-identical results on siloed data compared with DESeq2 on pooled data, and significantly outperforms meta-analysis baselines on siloed data.