Federated learning: Owkin pioneering real-world applications at NeurIPS 2022
2022 was a momentous year for machine learning science. We saw Meta AI’s HTPS deep learning model solve several International Math Olympiad problems, while multiple breakthroughs clearly demonstrated that the future of healthcare research is powered by AI. Google’s DeepNull modeled the relationship linking covariate effects on phenotypes and DeepMind’s Isomorphic Labs announced their ambition to apply a radical AI-first approach to biology research. Owkin scientists continued to discover and develop better treatments for unmet medical needs, improving clinical trials and launching two AI-powered diagnostic tools.
To finish off 2022, the international machine learning (ML) community descended upon New Orleans for the NeurIPS conference. With over 2,500 papers accepted in the most competitive review process for ML literature, as always, the congress served as both a launching pad for cutting-edge research and a comprehensive overview of where the field is heading as a whole. Owkin’s federated learning research team shares their reflections and highlights from the conference.
Owkin had two papers accepted at NeurIPS this year: FLamby: Datasets and Benchmarks for Cross-Silo Federated Learning in Realistic Healthcare Settings and SecureFedYJ: a safe feature Gaussianization protocol for Federated Learning. We also announced the open-sourcing of Substra (formerly Owkin Connect) - the world’s most proven federated learning software designed for healthcare research - now hosted by the Linux Foundation’s AI & Data Foundation.
We hope to unleash a new wave in collaborative research by releasing real-world proven software, datasets, and normalization methodologies to help data scientists overcome the challenges that have historically impacted progress in federated learning research.
Federated learning is one of the hottest fields in machine learning today
Some key themes emerged from this year’s conference: one is that the field of federated learning is coming to the forefront, as it was one of the top 5 keywords from accepted paper titles. Furthermore, in the opening keynote of the congress, federated learning was highlighted as one of the top 3 emerging fields in machine learning.
The top five federated learning talks of NeurIPS 2022
IBM held an expo on their library for federated learning designed for enterprise environments, which enables secure FL thanks to fully homomorphic encryption (thanks to pyhelayers).
It is challenging to choose the most stand out papers from over 2,500 possibilities, but here are the top five from our perspective:
- FedPop: A Bayesian Approach for Personalised Federated Learning, Nikita Kotelevski, Maxime Vono, Alain Durmus, Eric Moulines
- EF-BV: A Unified Theory of Error Feedback and Variance Reduction Mechanisms for Biased and Unbiased Compression in Distributed Optimization, Laurent Condat, Kai Yi, Peter Richtarik.
- On Sample Optimality in Personalized Federated and Collaborative Learning, Mathieu Even, Laurent Massoulie, Kevin Scaman, Inria Paris, DI-ENS, MSR Inria Joint Center.
- TCT: Convexifying Federated Learning using Bootstrapped Neural Tangent Kernels, Yaodong Yu, Alexander Wei, Sai Praneeth Karimireddy, Yi Ma, Michael I. Jordan, University of California, Berkeley.
- The Fundamental Price of Secure Aggregation in Differentially Private Federated Learning, Peter Kairouz, Google Research.
Transformers everywhere: large language models, diffusion, causality and potential implications for drug discovery
Beyond federated learning, we saw other major themes emerging with potential for huge impact in machine learning for healthcare research.
Language learning models based on transformer architecture are rapidly improving and being further refined by ‘chain of thoughts’ prompting, which enables models to decompose multi-step problems into intermediary steps. We have seen enormous growth in this field powered by a blend of open-source and commercial Python libraries such as huggingface and cohere:ai. Even the broader public is getting interested through OpenAI, whose Internet traffic has grown by over 700% this past year thanks to its ChatGPT research release.
Large language models can now integrate different kinds of data coming from NLP, computer vision, and any kind of sequence - including genomics data. This is all thanks to the underlying transformer architecture, which seems to provide more flexibility and expressive power than the previous state-of-the-art ConvNets. In the NeurIPS expo talk track, Lindsay Edwards from Relation Therapeutics explored how these cutting-edge machine learning approaches, in combination with single-cell technologies, can be used in drug target discovery research. In the future, this would mean LLM could break the cryptic language of DNAs and potentially uncover deep truths about how gene interactions are linked to diseases.
Read our top NeurIPS paper picks on large language models:
- Flamingo: a Visual Language Model for Few-Shot Learning
- Chain of Thought Prompting Elicits Reasoning in Large Language Models
- Memorization Without Overfitting: Analyzing the Training Dynamics of Large Language Models
In NVIDIA and OpenAI’s workshop on the landscape of deep generative learning, researchers explored diffusion models (or score-based methods) as a state-of-the-art technique for generating high-quality and diverse images. Generative adversarial networks (GANs) are no longer the most state-of-the-art approach in generative modeling, diffusion models are proving to be better suited for the task.
Although first used primarily on images, future real-world applications discussed at the conference ranged from computer vision to signal processing and computational chemistry. It might mean that the recent immense successes of such approaches in text to image generation could spill over to medicine.
Read our top NeurIPS paper picks on diffusion models:
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding
- Torsional Diffusion for Molecular Conformer Generation
Owkin federated learning research accepted at NeurIPS
FLamby: Datasets and benchmarks for cross-silo federated learning in realistic healthcare settings
Federated Learning (FL) is a novel approach enabling several clients holding sensitive data to collaboratively train machine learning models, without centralizing data. The cross-silo FL setting corresponds to the case of a few (2--50) reliable clients, each holding medium to large datasets, and is typically found in applications such as healthcare, finance, or industry. While previous works have proposed representative datasets for cross-device FL, few realistic healthcare cross-silo FL datasets exist, thereby slowing algorithmic research in this critical application.
In this work, we propose a novel cross-silo dataset suite focused on healthcare, FLamby (Federated Learning AMple Benchmark of Your cross-silo strategies), to bridge the gap between the theory and practice of cross-silo FL. FLamby encompasses 7 healthcare datasets with natural splits, covering multiple tasks, modalities, and data volumes, each accompanied by a baseline training code. As an illustration, we additionally benchmark standard FL algorithms on all datasets. Our flexible and modular suite allows researchers to easily download datasets, reproduce results, and re-use the different components for their research. FLamby is available at https://github.com/owkin/flamby
Read more about FLamby on our blog
SecureFedYJ: a safe feature Gaussianization protocol for Federated Learning
The Yeo-Johnson (YJ) transformation is a standard parametrized per-feature unidimensional transformation often used to Gaussianize features in machine learning. In this paper, we investigate the problem of applying the YJ transformation in a cross-silo Federated Learning setting under privacy constraints. For the first time, we prove that the YJ negative log-likelihood is in fact, convex, which allows us to optimize it with exponential search. We numerically show that the resulting algorithm is more stable than the state-of-the-art approach based on the Brent minimization method.
Building on this simple algorithm and Secure Multiparty Computation routines, we propose SecureFedYJ, a federated algorithm that performs a pooled-equivalent YJ transformation without leaking more information than the final fitted parameters do. Quantitative experiments on real data demonstrate that, in addition to being secure, our approach reliably normalizes features across silos as well as if data were pooled, making it a viable approach for safe federated feature Gaussianization.
Read more about SecureFedYJ on our blog
The world’s most proven federated learning software designed for healthcare research, now open sourced
During NeurIPS, we launched the open sourcing of our federated learning software Substra (previously known as Owkin Connect). This influential software is behind MELLODDY - the pharmaceutical industry’s largest ever collaborative project, which demonstrated for the first time that collaborating in AI for drug discovery at industrial scale. It powers a wide range of collaborative research initiatives such as the HealthChain consortium and the landmark Voice as a Biomarker of Health project, which aims to establish the human voice as a routine biomarker used to diagnose and treat diseases.
By open-sourcing Substra, with the LF AI & Data Foundation providing a neutral home based on open governance principles, we seek to enable data scientists to collaboratively train machine learning models, powering the next wave of federated learning research.
Ibrahim Haddad, PhD, General Manager of the LF & AI Foundation, part of the Linux Foundation, said that:
Innovation thrives in collaboration, not in isolation - and the open sourcing of Substra is a landmark moment in the use of collaborative AI in medical research. Researchers can now leverage privacy-preserving and secure federated learning software to drive cutting-edge collaborative medical research. Open source is undoubtedly the future of AI research.
At the conference, data scientists we spoke with were pleased to see that federated learning is not only a hot research area, but can also be applied in real industrial applications. Our team fielded many inquiries about the challenges of deploying such applications in the real world, like how to explain how it works to non-scientific stakeholders, the security guarantees that FL provides, how to actually deploy software in a hospital, etc.
Discover more about Substra
Looking ahead to the future of machine learning in 2023
The biggest lesson we learned at NeurIPS 2022 was how important the community is among machine learning researchers and data scientists. Everyone we met was so open and generous with their experience and talent. In our discussions, one thing is clear, a major challenge remains: how to securely train high-performing models on rich data without compromising privacy?
By open-sourcing Substra, Owkin is playing its part in helping ML experts collaboratively train models across siloed data. Coordinating groundbreaking collaborations like the FLamby datasets & benchmark project and providing secure methodologies like SecureFedYJ to help researchers overcome the challenges of dealing with data in real-world settings is just the beginning. We wish to help machine learning researchers, developers, universities, hospitals, and pharmaceutical companies around the world benefit from our secure technologies and scientific discoveries.
The future of medical research is collaborative.
Collaborate with us