Federated Learning

A decentralized technique to train algorithms with multiple data providers.

AI algorithms are very ‘data hungry.’ This means that to work well, they must be fed very large amounts of data about lots of different types of people. This usually requires taking different datasets from many different places (for example, different hospitals in different parts of a country, or even from different countries), and adding them all together in one central location. 

Unfortunately, this can cause problems, including: 

  • Security: making the central dataset more vulnerable to cyber-attacks
  • Privacy: posing a risk to patient privacy and confidentiality 
  • Compatibility: making it difficult to compare data that is formatted in different ways, for example, combining one dataset using metric measurements (e.g., grams) and one using imperial measurements (e.g., ounces)
  • Collaboration: making it hard for scientists from different organisations to work together

A solution to this is federated learning. In federated learning, none of the datasets ‘belonging’ to different hospitals leave their local hospital. Instead, the algorithm gets sent on a visit to each dataset, learns something new from each of the datasets, and updates itself accordingly. This process is repeated until all the different organisations have reached an agreement that the algorithm is now ‘complete.’

An analogy may make this easier. Imagine there is a baking competition, where six bakers are each trying to make the World’s Best Victoria Sponge Cake. All six bakers have the same basic ingredients: sugar, butter, flour, baking powder, and milk, but each baker is allowed to vary how much of each ingredient they use (e.g., 2 eggs or 4 eggs), and they can add extra ingredients if they wish (e.g., vanilla essence or fresh fruit), to come up with the ultimate recipe (i.e., the ultimate algorithm). 

In the centralised version of this competition, all the bakers would come to one location and bake their cakes in front of each other. The bakers in this scenario can each see what the other is doing (privacy challenge); the power may go out to the central location (security challenge); and they cannot consult friends for help or advice (collaboration challenge). One central judge then picks the recipe they like the best. 

In the federated version of the competition, each of the bakers bakes the cake in their own home where there is no risk of being copied, they can get help from their friends, and if the power goes out in their house they can use their neighbour’s oven. Once each baker is satisfied with their recipe, they each swap recipes and begin improving the new recipe. This process repeats until all six competitors agree that they have developed the best possible recipe (i.e. the best possible algorithm) and they win the competition as a team.     

The development of safe and effective AI algorithms for healthcare requires access to very large volumes of population-representative data. Algorithms trained on datasets that are too small, or non-representative, may have issues with both accuracy and bias. The challenge is that, very rarely (if ever) do datasets of the necessary size and representativeness exist in one location, under the control of one organisation. It is far more common for AI training datasets to be created by bringing together multiple different datasets, created in different geographic locations and settings, and controlled by different organisations, into one giant dataset and hosting this on a centralised server. There are, however, a number of problems with this solution:

  • Security
    Creating a very large, centralised dataset creates a security and ‘single point of failure’ challenge. A single source of a very large amount of highly sensitive patient data may be an attractive target for a cyber-attack and protecting against any such attack is likely to be resource intensive and extremely difficult. 
  • Privacy
    Patient data, for example electronic health record (EHR) data, is extremely sensitive and notoriously difficult to anonymise. Traditional methods of de-identification or ‘pseudonymisation’ involve the removal of names and addresses from patient records, but this does not sufficiently protect against privacy threats. It is often possible to re-identify patients from other information contained within their record, such as when they had a baby and where they might live. The chances of re-identification increase as the population coverage of a dataset increases. This is because in a dataset that only represents 10% of a population of interest, there is only a 10% chance that a ‘unique match’ is the person an attacker is trying to identify. In a dataset that covers 100% of the population of interest, there is a 100% chance that a ‘unique match’ is the person an attacker is trying to identify. 
  • Heterogeneity
    Most health data are held in siloes, are very different and often stored in inconsistent formats. For example, one hospital may have patients with ages ranging from 12-20, while another has patients over 70 years old. Or one hospital could have a patient cohort coming from a different diagnostic pathway, with more advanced cancer and bigger tumours than another hospital. Furthermore, some hospitals may have different ways of capturing samples or categorizing their data. 

    While a human brain could ignore missing data or recognise some inconsistencies during the learning process, machine learning algorithms require a precise systematic data structure to learn through the training process. Minor inconsistencies in the datasets used to train the AI model could lead to incorrect decisions, a multitude of errors, or mean the algorithm cannot function at all. As such, health data, when aggregated, typically requires a significant amount of ‘pre-processing’ so that all data in the project (e.g. age, size of tumour, etc.) uses a common scale, without distorting the underlying differences in the ranges of values. This can be a very labour intensive and inefficient process. 
  • Collaboration
    The best (i.e., highest performing algorithms are designed, developed, and deployed, by large diverse teams combining multiple different skills and life experiences. This minimises the chances of the resultant algorithms having ‘epistemic’ problems such as bias or false conclusions. Providing a very large number of people to a single dataset exacerbates privacy and security issues, and may also be legally challenging if, for example, the data scientists in question are distributed across national or international state lines. 

These challenges are so significant that, increasingly, rather than trying to overcome them, data scientists and software engineers are trying to ensure that they never arise by moving away from reliance on large, centralised datasets, and towards federated learning. 

Federated learning is a technique for training AI algorithms in a decentralised fashion. Instead of bringing all the different datasets from different organisations into one centralised dataset and training the algorithm there, a ‘baseline’ algorithm is sent to all the different (locally pre-processed) datasets where it is trained locally. Once this local training process has been completed, the ‘updates’ to the algorithm that have been learned (e.g., changes to the weights of different parameters) from each of the different locations are aggregated (either at a centralised server, or added to at each decentralised node). 

Then the baseline algorithm is updated and it is sent out again in an iterative fashion until a consensus regarding its design is reached. This means the different datasets never have to move, only the algorithm itself ‘moves’ between locations, meaning the issues described above are much less of a concern. In addition, algorithms trained in this federated fashion have the potential to be more accurate and reach more reliable conclusions than algorithms trained on potentially inaccurately merged and centralised datasets. 

Although this may seem similar to distributed learning, there are distinct differences. For example, distributed learning assumes that the local datasets are ‘independent and identically distributed’ and are all roughly the same size. Federated learning makes none of the same assumptions. It is, therefore, better equipped for dealing with the heterogeneous datasets typical of the healthcare industry and is not negatively affected by the fact that the datasets may vary considerably in size. 

An Owkin example

In a paper published in Nature, the Owkin team used federated learning to develop an algorithm that could predict how different women recently diagnosed with triple-negative breast cancer would respond to neoadjuvant chemotherapy (the standard treatment). The resultant proof-of-concept algorithm could accurately predict the response to treatment – matching the performance of the current best available prediction methods.

Further reading
  • Rieke, Nicola et al. 2020. ‘The Future of Digital Health with Federated Learning’. npj Digital Medicine 3(1): 119.