Bridging the gap between federated learning theory and practice with real-world healthcare datasets
Bridging the gap between federated learning theory and practice with real-world healthcare datasets
Federated learning (FL) is an approach enabling several institutions holding sensitive data to collaboratively train machine learning (ML) models at scale without pooling data. It is particularly promising in the field of healthcare, as it protects patient confidentiality and proprietary data by design: only the models travel while the datasets remain locally stored in distinct sites. Key benefits of FL is the ability to securely train models on larger and more diverse datasets which improves model performance, accuracy and helps minimise potential biases.
Federated networks of private and public stakeholders allow scientists to gather enough data to tackle open questions on poorly understood diseases such as Covid-19 or triple negative breast cancer for the first time. By producing models that are more representative of the heterogeneity inherent to us all, FL unlocks the potential to help match the right treatment to the right patient at the right time.
Although federated learning may likely power the future of collaborative healthcare research, many challenges are encountered when putting theory into practice. Real world FL projects require a consortium of partners willing to cooperate, along with considerable IT set up and maintenance costs. These factors, among others, have slowed federated learning research progress in the healthcare field compared to other applications of FL in mobile phones, wearable devices and finance.
Solving the challenge of applying federated learning in realistic healthcare settings
Before launching FL research projects on patient data, researchers need to refine their algorithms using public datasets on simulated partners to conduct initial experiments and find the most effective approaches. Researchers to this because there could be many unforeseen issues regarding training the model and want to avoid a model that would be unsuited for the task, for reasons such as overfitting data from one of the partners, or being too generic which risks the model then not being useful for any of the centers.
There is some research proposing representative datasets for cross-device federated learning, but few open, curated and ready-to-use healthcare decentralized datasets exist. As a consequence, researchers usually rely on heuristics to artificially generate heterogenous data partitions from a single dataset and assign them to hypothetical data owners.
Example of an artificially split dataset.
Because medical data is highly heterogeneous; synthetically splitting a single dataset into multiple artificial centers may not accurately replicate potential scenarios when moving from testing to deployment. Evaluating FL algorithms on datasets with natural data owner splits is a much safer approach to ensuring that new methodological FL research addresses real-world issues.
For example, in digital histopathology data used in cancer research, clinicians extract tissue samples (e.g. a resected tumor biopsy) from patients, which are then preserved, stained and digitized by a pathologist. There are many steps along this journey from sample to data that differ from hospital to hospital: patient demographics, staining techniques, storage methodologies of the physical slides, and finally the digitization process itself. This heterogeneity is nearly impossible to replicate with synthetic partitioning of datasets, calling for experiments across different research centers leveraging real patient data.
Source: The iWildCam 2020 Competition Dataset. Beery, S, Cole, E & Gjoka, A. ArXiv:2004.10340. 2020
This observation is valid across nearly every type of data in the healthcare field. Survival analysis data can vary widely based on geography and other demographic data. In radiology, when taking CT scans of an adrenal adenoma (a common benign tumor that forms in your adrenal gland) there can be slight changes in attenuation with different manufacturers’ equipment, which leads to statistically different imaging results.
In dermoscopy, like in histopathology, when a doctor takes an image of skin diseases, it requires precise color rendering to be made identifiable. Similar processes are also needed for retinal scans to diagnose eye diseases. This methodology can greatly differ depending on the digitization hardware and software used, among other factors.
Source: Dermoscopic Image Analysis for ISIC Challenge 2018. Zou, J, Ma, X, Zhong, C, Zhang, Y. https://arxiv.org/pdf/1807.08948.pdf. 2018
It is incredibly difficult to access high volumes of healthcare data, but without access to large and diverse datasets, researchers cannot build accurate and robust statistical models - preventing federated learning from reaching its full potential. So how can researchers run simulations to test new federated learning approaches in a way that will replicate real-world healthcare settings?
Introducing FLamby: the world’s largest open source federated learning ready datasets tackling multiple healthcare data modalities
Owkin researchers collaborated with nine of the world’s leading experts in federated learning alongside their PhD students and PostDocs to create FLamby (Federated Learning AMple Benchmark of Your cross-silo strategies), a suite of FL ready datasets split in a natural fashion, resurfaced from the centers from which the data originated. It is the most extensive open and ready-to-use benchmark for cross-silo FL in healthcare ever created, designed to bridge the gap between federated learning theory and practice.
This open-source federated cross-silo healthcare dataset suite covers seven key areas of healthcare data critical to research today:
CT scans (two different types)
MRI data (T1-weighted imagery)
Overview of the datasets, tasks, metrics and baseline models in FLamby.
Dataset and baseline model
These datasets cover different tasks (classification / segmentation / survival) in multiple application domains with different data modalities and scale. For each dataset, the suite provides documentation, metadata and helper functions to:
Download the original pooled dataset
Apply preprocessing if required, making it suitable for ML training
Split each original pooled dataset between its natural data owners
Easily iterate over the dataset with natural splits as well as to modify these splits if needed
The data suite is modular, with a standardized simple application programming interface (API) for each component. Crucially, all datasets are partitioned using natural splits and each is accompanied with baseline training code, making it easy to deploy in a wide range of applications. As the API relies on the commonly used PyTorch machine learning framework, it’s simple to iterate over the dataset and modify these natural splits if needed.
Federated learning strategies and benchmark
To illustrate the potential power of these datasets, standard federated learning algorithms - called strategies in the FLamby suite - have been benchmarked across all datasets. In order to be agnostic to FL frameworks, these strategies are built in plain Python code. We also provide guidelines to help compare FL strategies in a fair and reproducible manner and include illustrative results.
We have made open source code accessible for easy reproducibility and integration in different federated learning frameworks and existing libraries, such as Substra, FedML, and Fed-BioMed. In this way, we are connecting past and subsequent work in the field for better monitoring of the progress in cross-silo FL research. By providing the machine learning community with the tools they need, we welcome them to contribute to FLamby development by adding more datasets, benchmarking types and FL strategies in future.
Supporting the emergence of faster, more robust federated learning strategies
The FLamby repository is open to everyone; ready to be leveraged in a wide range of research applications. By sharing this standardized federated learning-ready data, we aim to build a collaborative community that accelerates the use of federated learning strategies in a wide range of healthcare applications. In launching a dataset of this unprecedented size and scale, we hope that FLamby will accelerate the emergence of new ideas and more robust strategies in the field of FL, in a similar way as ImageNet did in the past for computer vision.
Deep heterogeneity of FLamby datasets. Best seen in color. 1a: Color histograms per client. 1b, 1c and 1d: Voxel intensity distribution per client. 1e: Kaplan-Meier survival curves per client. 1f: UMAP of deep network features of the raw images, colored by client. 1g: Per-client histograms of several features. Differences between client distributions are sometimes obvious and sometimes subtle. Some clients are close in the feature space, some are not and different types of heterogeneity are observed with different data modalities.
This groundbreaking collaboration has been consolidated in a research paper accepted at NeurIPS 2022 (dataset & benchmark track): “FLamby: Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings.”
FLamby was made possible thanks to the support of the following institutions:
We would like to thank all participants in the FLamby project for their contributions:
Jean Ogier du Terrail, Senior Machine Learning Scientist at Owkin
Samy-Safwan Ayed, Research Intern at INRIA
Edwige Cyffers, PhD Student at INRIA
Felix Grimberg, PhD Student at EPFL
Chaoyang He, Doctoral Researcher at University of Southern California, Co-Founder & CTO of FedML
Regis Loeb, Data Scientist at Owkin
Paul Mangold, PhD Student at INRIA
Tanguy Marchand, Senior Data Scientist at Owkin
Othmane Marfoq, PhD Student at INRIA and Accenture Labs
Erum Mushtaq, PhD Student and Graduate Research Assistant at Information Theory and Machine Learning Laboratory at University of Southern California
Boris Muzellec, Federated Learning Data Scientist at Owkin
Constantin Philippenko, PhD Student at Ecole Polytechnique
Santiago-Smith Silva-Rincon, Researcher at INRIA
Maria Telenczuk, Data Scientist at Owkin
Shadi Albarqouni, Professor of Computational Medical Imaging Research at the University of Bonn
Salman Avestimehr, Dean’s Professor and Director of USC-Amazon Center on Trustworthy AI at University of Southern California
Aurélien Bellet, Researcher at INRIA
Aymeric Dieuleveut, Assistant Professor at Ecole Polytechniques
Martin Jaggi, Tenure-Track Assistant Professor at EPFL
Sai Praneeth Karimireddy, SNSF postdoc at UC Berkeley
Marco Lorenzi, Research Scientist at INRIA
Giovanni Neglia, Researcher at INRIA
Marc Tommasi, Professor in Computer Science at Lille University
Mathieu Andreux, Federated Learning Group Lead at Owkin
Take a closer look