Privacy-preserving machine learning to protect health data

Duration:21 mins

Tags: FL


Date:February 25th, 2021


Privacy-preserving machine learning to protect health data

This first post opens our dedicated series of three articles on federated learning applied to healthcare. Discover its impact on health data privacy and confidentiality now. Also, catch the next articles on collaboration and traceability in the coming weeks!

Access to large volumes of high-quality clinical data is currently one of the biggest bottlenecks for machine learning technology in healthcare. Health data is extremely sensitive and requires handling with particular care, demanding tight regulations. This article explores how this challenge can be overcome through federated learning, a machine learning technology that drastically reduces privacy concerns by keeping patient data stored securely onsite during model training. Owkin brings this privacy-preserving machine learning technology to healthcare stakeholders, unlocking the potential for safer, better, and more effective medical research.

Key highlights

  1. The potential of machine learning (ML) technology in healthcare is currently limited by fundamental data access challenges, such as ‘siloed’ data storage within many different hospitals. Concerns about insufficient transparency of ML systems and inadequate privacy settings to protect highly sensitive health data make big data access even more taxing.

  2.  Federated learning (FL) is on track to be ‘the next big thing’ in advancing medical research without compromising health data privacy. It allows an ML algorithm to learn on different datasets without removing the data from where they are stored. Hospitals and research centers keep control over data governance and GDPR compliance. Additional privacy-preserving measures such as differential privacy and secure aggregation allow for new ways to protect data in FL.

  3. Owkin has been driving FL in healthcare to promote privacy-preserving ML and revolutionise the way medical research is done. Facilitated by its FL software Owkin Connect, FL was already successful in predicting treatment response in breast cancer. And more collaborative research projects across disease areas are underway at Owkin.

  4. FL-based medical research has a significant impact on healthcare stakeholders. Doctors and patients benefit from better predictions for treatment and diagnosis when more data is available to learn from, whilst trusting that their sensitive data are protected. Researchers can build better models and safely expand their collaborative network, whereas healthcare providers move closer to efficient, value-based healthcare.

Imagine for a moment you are a food scientist 

…who has invented a method to make synthetic honey. You hear about the harmful impact of climate change on global bee populations* – bees struggle to produce enough honey to feed themselves and their offspring because nectar-rich flowers are declining in their natural habitats. 

You realise that your invention could help the bees – by offering access to your recipe for synthetic honey and providing them with a complementary food source. However, in order to succeed, you need the bees to collaborate. Only by combining their collective honey expertise will you be able to create a recipe that contains the right nutrients for the bees. You send out ambassadors to the best beehives in the world, asking them to share information about their proprietary honey recipes.  But unfortunately, your emissaries fail to fulfil their mission.

Why? First of all, imagine the daunting logistics challenge of gathering this information from hundreds of beehives spread across the globe. What’s more? Bees are quite reluctant to share their honey expertise with other colonies. It is highly confidential information passed down from bee generation to generation and under no circumstances is it to fall into the wrong wings. Lastly, the regulatory bee consortium has done an excellent job at restricting the export of honey recipes from beehives to protect the bee industry from honey counterfeits and misuse of the liquid gold. Honey is close to the bees’ heart and soul, after all. 

How can you solve this problem and get the bees to collaborate?

* Whilst this is a fictional analogy to explain Federated Learning, the impact of climate change on bee populations is real.

The challenge of gaining data access in healthcare

This ‘honey conundrum’ is a fitting analogy for the challenge we are facing in healthcare. On one hand, we have developed powerful AI-based technology, especially machine learning (ML) approaches. This technology can help us answer burning questions in medical research: which patients should be included in clinical trials to find the right treatments faster? Which molecules are the most promising targets for drug development? How can we best extract information from brain images to detect cancer, or possibly early onset of COVID-19?

ML approaches are, however, “data-hungry” (just like your honey recipe, which will only be suitable for the bees’ needs if they contribute with their expertise). Models need access to large and diverse datasets to learn, improve accuracy and remove bias (1). For instance, a model to predict heart attack symptoms in the UK population is unlikely to be widely applicable if it has only been trained on the data collected at local GP practices in a suburb of London with a predominantly young, caucasian population.

Without access to sufficient data, ML will be prevented from reaching its full potential and, ultimately, from making the transition from research to clinical practice. (2)

Data gathering in healthcare is difficult for several technical reasons (3,4).

(i) The majority of patient data are stored in “silos”, meaning at different hospitals and research centres, distributed across servers and databases, and therefore challenging to access. The other type of valuable data, chemical compounds used for drug discovery and development, are usually kept in-house by pharma companies. Those ‘chemical libraries’ are unlikely to be shared easily due to the competitive nature of the healthcare industry. 

(ii) Data scientists are dealing with heterogeneous datasets from multiple sources (e.g. electronic health records, national or international databases, clinical trial results) and in different formats (e.g. medical history, laboratory test results, radiology images). This can make data collection and preparation for analysis tedious, lengthy and costly.  

(iii) Another challenge arises from using anonymised data for medical research. Whilst this means either completely removing any identifiable information or, more commonly, ‘pseudonymising’ data (such as replacing a patient’s name with a code) to protect a person’s identity, it can actually decrease performance of an algorithm (2). Consider again the case of training a model to predict heart attack symptoms – which would hardly be clinically useful without including patients’ date of birth or gender in the training dataset. Ultimately, whether anonymised or not, the data belong to patients who may or may not consent to  their data being sold or transferred to third parties. 

The constraints are therefore more ‘human’ than ‘tech’.

If we consider information about honey-making the ‘essence of life’ for the bees, our healthcare data are just as sensitive, even intimate, for us and any misuse “may feel like a violation to our body” (5). In addition, ML approaches often seem like intimidating ‘black boxes’ that prevent us from fully understanding how a complex ML algorithm was built and makes predictions (6,7). 

Recent privacy breaches in the healthcare sector, even more, visible during the COVID pandemic (8) are also gravely undermining trust in ML applications and those who handle our health data. Imagine someone having to board a plane for the first time and not being able to trust the engine, the pilot, the attendants, and the weather forecast! The aim, however, is not ‘blind’ trust in ML approaches but rather establishing an ‘optimal trust level’, rooted in sensible safeguarding of our data (9).  

We have, of course, the GDPR (general data protection regulations) in place in Europe to protect health data privacy (10). The dilemma that arises with a highly protective framework is that many ML approaches fail to fulfill the standards of what is considered trustworthy AI, as proposed by an independent expert panel at the European Commission – lawful, ethical and robust technology (11).

Img 1: Data Access problem in healthcare

AI systems in healthcare have not reached their potential yet because data sit in silos and access is limited through strict privacy regulations such as GDPR (General Data Protection Regulation).

Introducing federated learning as a health data privacy-preserving machine-learning technology

What if we can turn the current ML approach on its heads and leave data where they are, i.e. stored locally within hospital firewalls, but let the algorithm travel to the data allowing the models to be trained onsite?  

In bee-speak, this would mean that each beehive gets a copy of your master recipe, optimises it according to their customs, and sends back only the updated recipe, anonymised and encrypted. The updates are averaged over several beehives, and a little bit of ‘random noise’ is added to each recipe, so that none of the information can be traced back to an individual beehive or bee, before combining them into your master recipe. Each beehive benefits by having access to your life-saving synthetic honey, based on a recipe which is ever-improving through more and more contributing hives. 

This, in essence, is federated learning (FL) – a privacy preserving, decentralised and collaborative Machine Learning technology. Whilst FL first emerged from other fields – think mobile phones or self-driving cars learning collaboratively in IoT-style (12,13) – it is particularly well-suited for healthcare (2,14,15). It means medical researchers can send their algorithms to where the data is stored and train them onsite. This solves the problem of having to move around ‘siloed’ sensitive data. 

Data owners (hospitals, research centers) remain in control of their data according to their own privacy guidelines and associated regulatory standards (16). Consequently, patients do not have to worry about what happens to their data beyond the hospital firewall (or outside the beehive, for that matter), as data won’t ever leave this safe space.

However, additional safety measures are key to make absolutely sure that a model trained with FL does not reveal confidential information about the training data. Examples include secure aggregation, differential privacy (remember summarizing insights over several beehives and adding noise to remove links to their origins?) and encryption (17). These layers of privacy promote trust in ML approaches and open the doors for large-scale, collaborative research projects.

Federate Learning in action at Owkin: decision-making support for physicians

Let’s look at how this works in practice at Owkin, a French-American startup using AI and FL to accelerate medical research. The team has dedicated three years of research and development to building an FL software, Owkin Connect. It connects multiple data sources without accessing private data. The platform is based on Substra, an open-source framework for secure and traceable ML orchestration in multi-partner settings.  

Additional privacy-preserving technologies (such as secure aggregation and differentially private model training) (17,18) have been implemented to prevent data leakage and make each step of data handling safe and GDPR-compliant. Owkin also addresses the ‘black-box problem’ in ML by ensuring model transparency with distributed ledger technology (see upcoming article).

At the beginning of 2020, the HealthChain project provided proof of concept for Owkin Connect (see this blog article). Together with clinical, research and technology partners (Institut Curie, Nantes University Hospital Center and The Centre Léon Bérard, among others), it was shown that an algorithm can be trained successfully on histology images, siloed at different clinical centers, to predict treatment responses in breast cancer. By demonstrating robustness and safety of the technology, the stage was set for further collaborative research projects and eventually clinical applications in cancer, heart failure and other disease areas (more on this in the next article).

People don’t realize the sand is shifting under their feet and that we can now in fact achieve privacy and utility at the same time (14)

Physicians, patients and data scientists benefitting from FL

But how, you might ask, will FL impact those involved in healthcare (2)? Physicians, being closest to patients, will benefit from diagnostic support through FL-trained models. It can help them to recognize potential biases in their decision making, as well as assist them with insights generated from larger datasets as they may have access to within their own institution.  

For patients, similarly, it means access to greater medical expertise, particularly when being treated for a rare condition, or at a hospital which does not have the respective expertise. In addition, greater trust in AI systems can lead to a higher willingness for data donations, which in turn boosts model-accuracy. Hospitals, whilst remaining in charge of data ownership and governance, can tap into external AI capabilities and enhance innovation.    

FL technology also opens endless possibilities for data scientists and researchers to work on emerging research questions and improve their models, trained on bigger and more representative datasets. Better predictive models also reduce healthcare cost for providers and insurers, which are under increasing pressure to provide value-based care for better outcomes.

Conclusion and outlook

FL offers exciting new opportunities for research breakthroughs by overcoming fundamental privacy concerns. With more valuable use cases in healthcare expected to emerge in the near future (2), outstanding technical questions can be addressed, and more healthcare partners will be willing to collaborate in a safe and secure manner. And this will be a true paradigm shift in precision medicine and ultimately improve patient care – better honey indeed. 

The next article of this three-part series will explore how FL enables new forms of collaboration between academia, clinical centres and pharma companies. Part three will dive into the potential of FL for tackling the black-box problem of ML with distributed ledger technology.


  1. Cahan, E.M., Hernandez-Boussard, T., Thadaney-Israni, S. et al. Putting the data before the algorithm in big data addressing personalized healthcare. npj Digit. Med. 2, 78 (2019).

  2. Rieke, N., Hancox, J., Li, W. et al. The future of digital health with federated learning. npj Digit. Med. 3, 119 (2020)

  3. Ghassemi M, Naumann T, Schulam P, Beam AL, Chen IY, Ranganath R. A Review of Challenges and Opportunities in Machine Learning for Health. AMIA Jt Summits Transl Sci Proc. 2020;2020:191-200. (2020).

  4. Agrawal, R., Prabakaran, S. Big data in digital healthcare: lessons learnt and recommendations for general practice. Heredity 124, 525–534 (2020).

  5. Miller R., “Healthcare Data: Federated Learning, Sovereign Identity, and Standardization”, Podcast Voices of the Data Economy (2020). [Accessed 24/02/2021]

  6. Watson, DS., Krutzinna, J., Bruce, IN., Griffiths, CE., McInnes, IB., Barnes, MR., Floridi L., Clinical applications of machine learning algorithms: beyond the black box, BMJ 2019;364:l886 (2019)

  7. Haibe-Kains, B., Adam, G.A., Hosny, A. et al. Transparency and reproducibility in artificial intelligence. Nature 586, E14–E16 (2020). 

  8. Ienca, M., Vayena, E. On the responsible use of digital data to tackle the COVID-19 pandemic. Nat Med 26, 463–464 (2020).

  9. Asan O, Bayrak AE, Choudhury A. Artificial Intelligence and Human Trust in Healthcare: Focus on Clinicians. J Med Internet Res. 2020;22(6):e15154. (2020) doi:10.2196/15154

  10. Website GDPR explained. [Accessed 24/02/2021] 

  11. Report Ethics guidelines for trustworthy AI, European Commission (2019) [Accessed 24/02/2021] 

  12. McMahan, B., Ramage D., “Federated Learning: Collaborative Machine Learning without Centralized Training Data”, Google AI Blog (2017) [Accessed 24/02/2021]

  13. Wolf, S. F., Federated Learning, Computer Science Blog, (2019) [Accessed 24/02/2021] 

  14. Hao, K., “A little-known AI method can train on your health data without threatening your privacy, Technology Review”, (2019) [Accessed 24/02/2021] 

  15. Kaissis, G.A., Makowski, M.R., Rückert, D. et al. Secure, privacy-preserving and federated machine learning in medical imaging. Nat Mach Intell 2, 305–311 (2020).

  16. Nguyen Truong, Kai Sun, Siyao Wang, Florian Guitton, Yike Guo- Privacy Preservation in Federated Learning: Insights from the GDPR Perspective, arXiv:2011.05411v4 [cs.CR]

  17. Lake, J., “Federated learning: Is it really better for your privacy and security?”, comparitech (2019) [Accessed 24/02/2021]

  18. Beguier C. , Tramel ET- SAFER: Sparse Secure Aggregation for Federated Learning, arXiv:2007.14861v2 [stat.ML] (2020)

About the author: Marion Oberhuber, PhD. Passionate about merging insights from neuroscience, psychology, behavior change and the life sciences industry to make complex things understandable.