Tackling the challenge of normalizing data for machine learning while preserving privacy

Machine learning (ML) algorithms are only as effective as the data they are trained on. To ensure accurate performance outcomes, they require large volumes of high-quality data, which becomes significantly more complex when doing healthcare research. Using a wide range of data sources can better represent diverse patient populations and drive more accurate results, but many practical problems arise when training ML models on datasets sourced from different institutions. Healthcare data is highly heterogeneous, even more so when doing collaborative projects across multiple hospitals with different methodologies and IT infrastructures.

It’s critical to ensure that data used for training machine learning models uses a common scale, but different institutions may have patients with different ranges of values. For example, one hospital may have patients with ages ranging from 12-20, while another has patients over 70 years old. Or one hospital could have a patient cohort coming from a different diagnostic pathway, with more advanced cancer and bigger tumours than another hospital. Furthermore, some hospitals may have different ways of capturing samples or categorizing their data. While a human brain could ignore missing data or recognise some inconsistencies during the learning process, machine learning algorithms require a precise systematic data structure to learn through the training process. Minor inconsistencies in the datasets used to train the ML model could lead to incorrect decisions, a multitude of errors, or mean the algorithm cannot function at all.

‍

What is data normalization?

Although it may not be the most exciting aspect of the research process, data scientists spend about 80% of their time on data collection and preparation. It requires significant pre-processing work to normalize the data coming from different sources before it’s ready to be used in AI applications, mathematically adjusting the values so that all data in the project (e.g. age, size of tumor, etc.) uses a common scale, without distorting the underlying differences in the ranges of values. This is a fairly straightforward process when the research data is pooled in a central location to ensure it is complete and consistently uses the same scale.

Interested in learning more about the importance of data preparation in healthcare? Read our three part series

When using federated learning (FL), machine learning models are securely trained across multiple data holders without centralising the data. Due to the inherent sensitivity of medical data, healthcare is a perfect application of federated learning, as it can train a biomedical ML model between different hospitals in a secure distributed IT architecture as if all the datasets were pooled in a central server. However, the inability to actually pool the data in a central location raises new questions of how to appropriately prepare and pre-process the data so it is ready to be used for training in this way.

Nearly all federated learning research in the healthcare field to date has been performed on simulated data. While different ways of adapting federated training algorithms have been proposed to automatically tackle heterogeneity (e.g. FedProx, SCAFFOLD), these solutions do not address data normalization prior to FL training. Furthermore, using standard FL algorithms such as FedAvg might not provide enough privacy guarantees when it comes to dealing with highly sensitive data, as it could be vulnerable to attacks such as data reconstruction (making it possible to obtain the private training data in only a few gradient steps).

The inability to accurately and securely pre-process data has likely been a blocker to advancing federated learning research in real-world healthcare settings. Owkin scientists have taken on this challenge and investigated how to transform some common approaches to securely normalizing real world healthcare data in combination with Secure Multiparty Computation protocols such as Secure Aggregation to mitigate privacy concerns in this context.

‍

Common approaches to normalizing healthcare data for machine learning

Many of the types of data routinely captured in clinical settings, such as blood pressure readings, electronic health records, blood cell counts and clinical trial data all fall under the umbrella of ‘clinical tabular data'. The most well-known approach to normalizing tabular data for machine learning is the Box-Cox transformation, which is a statistical technique that transforms the target variable so that data more closely resembles a normal distribution, which improves the ML model training process.

*Example transformation of raw data to more closely resemble a normal distribution.*

‍

A commonly used extension of the Box-Cox transformation is the Yeo-Johnson (YJ) transformation, which can deal with both positive and negative data. Owkin scientists investigated applying the YJ transformation in a federated learning setting to see if it was possible to:

Obtain a result identical to the case where all the data is pooled in the same server
Avoid leaking any information on the data from each center
Use an algorithm that could be realistically applied in real-world FL projects

YJ is a transformation parametrized by a parameter lambda. Applying YJ requires fitting a parameter lambda by minimizing a negative log-likelihood that depends on lambda and on the data used. For the first time, we have proven that the YJ negative log-likelihood is in fact convex. This allows us to optimize it with exponential search and we have shown that the resulting algorithm (ExpYJ) is more stable than the methodology used in state-of-the-art implementation, methodologies such as the Brent minimization method. We have combined this new algorithm with our expertise in Secure Multiparty Computation routines to create SecureFedYJ - a federated algorithm that performs a pooled-equivalent Yeo Johnson transformation in a distributed setup.

*Secure Multiparty Computation ensures that data is safely normalized while avoiding potential exposure to privacy attacks*

‍

This approach reliably normalizes data features across different centers, while preserving confidentiality by hiding the intermediate quantities exchanged in the process. Experiments on real healthcare data demonstrate that Secure FedYJ drives the same quality improvements as if the data were pooled in a central server.

A new approach to normalizing data for federated learning

This novel discovery has been consolidated in a research paper accepted at NeurIPS 2022: “SecureFedYJ: a safe feature Gaussianization protocol for Federated Learning.”

Clinical tabular data is just the beginning, Owkin scientists are actively investigating how to securely normalize other data modalities such as genomics and transcriptomics in federated learning settings to power the future of privacy-preserving healthcare research.

We would like to thank all participants in the SecureFedYJ project for their contributions: