Federated learning: the most effective collaborative AI framework for healthcare

The majority of healthcare data that exists today is greatly underutilized, remaining siloed across different hospitals, research centres and institutions. It can be challenging for data scientists to harness a sufficiently large volume of the right quality data to train machine learning models, largely due to privacy concerns. Increased collaboration between the institutions that own healthcare data is crucial to safely unlock the access necessary for AI initiatives to achieve higher performing outcomes at scale.

The field of collaborative data science has grown significantly in the past few years, as evidenced by the emergence of numerous open source frameworks for collaborative machine learning, such as Substra, FATE, PySyft, TensorFlow Federated, and NVIDIA Clara.

The techniques that enable collaborative machine learning have also diversified and now include distributed learning, federated learning, decentralized machine learning and swarm learning. Which approach is best for collaboration in a healthcare setting? At Owkin, we’ve spent the past four years enabling real world collaborative machine learning across large projects with world-class hospitals, research centres and pharmaceutical companies. Here’s why we bet big on federated learning as the future of healthcare collaboration.

An overview of collaborative machine learning techniques

It can be challenging to make a clear distinction between different collaborative machine learning techniques in use today, as they share similar conceptual and architectural roots. Federated learning – a decentralized machine learning technique with a privacy component – is an extension of distributed learning, and swarm learning is a type of fully-decentralized federated learning. The difference between these notions often lies in the type of computing and how privacy aspects are managed.

Distributed learning

Distributed learning is a machine learning setup in which data is physically stored in multiple distinct locations called nodes. In a typical distributed learning environment, nodes work in parallel to speed up model training. This approach is particularly useful for processing large volumes of data which typically cannot fit into a single machine. Distributed learning works with many machine learning models, but it is best suited to deep learning models, which are particularly data hungry.

Decentralized machine learning

Decentralized machine learning is a less-strictly defined technique and refers to any distributed machine learning setup in which data is not centralized, and where there is no central node to aggregate computations; instead, direct peer-to-peer communication is used to aggregate results. Decentralized machine learning is very closely linked to the concept of edge computing, which is a distributed computing paradigm that brings computation and data storage closer to the sources of data.

Federated learning

Federated learning is an extension of distributed learning that enables collaboration in a privacy preserving manner, avoiding the potential concerns raised by data pooling. The term itself was coined by Google in 2016 for mobile applications, with both a single server orchestrating the computations and aggregating the results, hereafter denoted as ‘vanilla’ federated learning. Since then, many other variants have been proposed by the research community to tackle the goal of collaborative machine learning across distributed clients under privacy constraints.

At Owkin, we define federated learning as any decentralized machine learning approach to train machine learning models with multiple data providers. Instead of gathering all data on a single server, the source data remains locked on local servers and only the predictive models and related quantities are exchanged between the servers. The goal of this approach is for each participant to benefit from a larger pool of data than their own, resulting in increased machine learning performance, while respecting data ownership and privacy. We stress that this broad definition can encompass both centralized and decentralized orchestration or aggregation.

Swarm learning

Swarm learning is a decentralized machine-learning approach that unites edge computing, blockchain-based peer-to-peer networking and coordination while maintaining confidentiality without the need for a central coordinator. As recent research published in Nature points out, the main difference between swarm learning and vanilla federated learning lies in the dynamic choice of a node to perform the aggregation at each round. Although the original vanilla federated learning formulation operates with both a central coordinator and server, in general, federated learning does not necessarily require a central coordinator, nor a central aggregator. Simply put, swarm learning can be considered as a form of federated learning.

Why does Owkin use federated learning?

At its origin, the tech stack of Owkin Connect, our federated learning software, was very close to swarm learning as defined above. We developed Connect to orchestrate distributed machine learning tasks in a secure way, based on Substra, an open-source software framework that leverages the distributed ledger technology Hyperledger Fabric. Although we believed in blockchain-based decentralized orchestration at the outset, in practice we learned this methodology is better suited to simulated scenarios rather than real-world settings.

In particular, consortium agreement is essential to creating collaboration in healthcare. In September 2018, we launched HealthChain, a federated learning project with four French hospitals. Due to the high sensitivity and value of healthcare related data, a strong contractual set up to accept new joiners in the project was paramount. The need for a trustless blockchain became redundant thanks to the strength of these consortium agreements. Blockchains are indeed useful when there is a trust problem, but sound contracts resolve trust issues.

In March 2019, we started to deploy Owkin Connect in hospitals. Traditional hospital security systems have whitelist (or allowlist) systems, cybersecurity strategies to grant network access to trusted individuals with specific IP addresses. As new joiners need to secure the agreement of all hospitals and be manually added to join the network, there was a need to define the precise list of stakeholders to join the networks well in advance of launching the initiative. With a decentralized approach requiring peer-to-peer connections, it would have been impossible to scale the network.

In July this year, the MELLODDY project, the largest ever use of federated learning in the pharmaceutical industry, demonstrated that drug discovery research using a federated model outperforms standalone models. The three-year, EU-funded project involving 17 partners was a landmark moment in the use of AI in medical research and demonstrated how federated learning can improve model performance while protecting commercially-sensitive data.

‍

Building real-world collaborations

In machine learning literature today, federated learning and swarm learning are at times compared as different techniques. However, as federated learning can be done without a centralized orchestrator, we can technically consider swarm learning as a form of federated learning. Our experience managing healthcare consortiums training machine learning models on data at scale has demonstrated that federated learning with centralized orchestration is better suited for real-world healthcare projects, mainly due to the privacy, IT and IP requirements of partners.