At Owkin, our mission is to fuse AI and clinical research to unlock medical discoveries. In doing this, we connect medical researchers, biopharma companies, and data scientists in a collaborative, privacy-preserving ecosystem.
Access to large-scale, meaningful medical data is a significant challenge in healthcare. As a result, Owkin has dedicated three years of R&D to develop Owkin Connect (‘Connect’). This innovative software is our proprietary federated learning framework that ‘connects’ multiple data sources. It enables distributed machine learning training without aggregating or collecting private data.
In January 2020, Connect enabled the first-ever federated deep learning model (‘Model’) trained on distributed histology images stored behind hospital firewalls. This model was a proof-of-concept for “Connect”. Additionally, two datasets and more than 40 talented and dedicated people from Owkin, Centre Léon Bérard (‘CLB’), and Institut Curie) contributed to this success.
The path to this incredible achievement was not straightforward; however, it was full of many discoveries and valuable lessons. This proof-of-concept for federated learning via Owkin Connect is a remarkable technical triumph. It is also an example of outstanding collaboration and an empowering conclusion to our journey.
In this article, we outline a roadmap for how we got here and what we learned along the way.
Step 1: Building the consortium
To implement federated learning, we needed to identify two or more medical centers with curated datasets of similar data types. Thus, Owkin turned to its internationally-renowned partners CLB and Institut Curie, both part of UNICANCER. (A federation of French medical centers that are pioneering cancer research and improving patient care).
Together, Owkin, CLB, and Institut Curie decided to investigate breast cancer because of its incidence, specifically triple-negative breast cancer. The prognosis for this disease is poor and typical treatments are of limited efficacy. This project was driven by visionary leaders: Thierry Durand (Director of IT at CLB); and Alain Livartowksi MD, (Deputy Head of the Data Department at Institut Curie). In addition, the two co-principal investigators: Dr. Pierre-Etienne Heudel, oncologist, and Dr. Guillaume Bataillon, a pathologist at Institut Curie, orchestrated the gathering of data from hundreds of breast cancer patients. All four leaders brought expertise and commitment, which were essential to the project.
This initial partnership formed the basis of Healthchain. This is a public-private consortium with a €10M budget funded by Banque Publique d’Investissement. Its goal was to develop the federated learning framework (Owkin Connect) and to train predictive models in oncology (breast cancer and melanoma) and fertility. The consortium gathers seven public partners: CLB, Institut Curie, CHU Nantes, AP-HP, Université Paris Descartes, and Ecole Polytechnique, and three private partners: Owkin, Substra Foundation, and Apricity.
Step 2: Project management
Challenges arose in managing the federated learning breast cancer project. These were due to the multiple stakeholders involved and the novelty of the machine learning framework. Close collaboration and strong coordination were essential. At Owkin alone, the project required the expertise of more than 25 people from almost ten different teams. These were mostly from Tech, IT Operations, Data Science, Research, Product, Partnerships, Legal, and Operations. At CLB and Institut Curie, we interacted with Medical, Valorisation & Innovation, Legal, and IT teams. Additionally, we relied on HealthChain data engineers, Clément Joly at CLB and Armand Léopold at Institut Curie, for internal coordination. Project management requirements included: Interdisciplinary and precise understanding of collaborator workflows and needs; Clearly defined objectives; Stringent road-mapping; Anticipation of risks and bottlenecks; Agile decision-making; and efficient communication.
Step 3: Federated Learning Model Security
Startups are familiar with the concept of Minimum Viable Product (MVP). This means building a quick and dirty—yet functional—an early version of the product to collect user feedback with little effort before scaling up production. However, in this case, we could not afford to give our partners an MVP. Any mistake in the code would have severe consequences. As with any software deployed within a hospital’s infrastructure, a security breach could endanger the hospital’s whole IT system and affect all activities, even patient care. Additionally, Connect trains algorithms on patient data and shares those algorithms between centers. A data leak could affect privacy, even though the patient data is pseudonymized. Hence, we had to strengthen every security aspect of Connect before deploying it at CLB and Institut Curie.
In October 2019, we ran a thorough risk analysis and a successful security audit. This did not identify any severe security weaknesses. Therefore, in November 2019, we were proud to present our security protocols to the Healthchain partners. They received approval from the Director of IT and Data Protection Officers, Thierry Durand, and Franck Mestre at CLB and Astrid Lang at Institut Curie to deploy Connect behind their firewalls and run federated learning model training on their data.
Step 4: Contractual support
When delivering new cutting-edge technology, an organization must be innovative in every step of the process and reinvent everything, even the contracts. Given that federated learning is a new, emerging technology, limited examples of reference contracts exist. As a result, Owkin had to design specific tripartite agreements between Owkin, CLB, and Institut Curie. These contracts defined training models in a federated learning context. They state the expertise each party brings and how the intellectual property and potential revenues will be shared.
Step 5: Connect Software development
To design, prototype, and develop Connect, we needed considerable effort and interdisciplinary skills. This effort mobilized a full team of 10 people, including designers, software engineers specialized in frontend or backend over almost two years. The Connected framework ensures that algorithms train on distributed sensitive datasets that remain within the medical centers’ infrastructure, which have generated the data. As a result, only the models and non-sensitive metadata share between Owkin and its partners on a secure network. Permission settings and a distributed ledger (which orchestrates the computations and ensures that they are traced and authentic) guarantee compliance with data governance requirements and privacy-preserving operations. A significant recognition of these cutting-edge privacy-preserving and security standards came with the open-sourcing of Connect’s core code in October 2019, under the name Substra.
This open-sourcing was in line with Owkin’s commitment to full transparency with its partners, and the quality and security of the Connect backbone code. The core code repository, Substra, is hosted by Substra Foundation. Similarly, this Foundation actively promotes responsible and trustworthy data science across different sectors.
Step 6: Connect deployment
The first deployment of Connect on CLB and Institut Curie’s infrastructure turned out to be quite complicated. We initially thought we could have a ‘one-size-fits-all’ deployment. It soon became apparent that we must adapt our setup to match the infrastructure of each partner.
Step 7: Data science
As expected, data science presented several challenges. Specifically, those related to machine learning on medical data and those inherent to this innovative federated learning technology. Medical researchers and data engineers at CLB and Institut Curie played a pivotal role in overcoming these difficulties. They achieved this by working hand in hand with Owkin data scientists.
The input data used to train the Model was whole slide images (WSIs) of triple-negative breast cancer biopsies, from approximately 100 patients at CLB and 240 patients at Curie. The Model’s objective was to predict the treatment response to neoadjuvant chemotherapy. We enumerated the physically archived slides one by one and made the annotation criteria uniform between both sites in preparation for machine learning. Within the same institution, two slides can be very heterogeneous since the source biopsies’ collection took several years, and methodology can change between practitioners. Thus, the team played particular attention to curating the test database used to evaluate the model performance at each institution. The resulting test dataset was: (i) entirely separate from the dataset used to train the model, (ii) representative of the target patient population, and (iii) account for the different data collection tools and methodologies.
Step 8: Federated Learning model
The scientists in the Federated Learning Research team at Owkin developed this federated learning approach. Furthermore, with Owkin’s talented engineering teams, this team has built a robust internal library of state-of-the-art federated and privacy-preserving techniques. The challenge in creating such a library was to ensure the interoperability of the latest research findings in the field with the complex machine learning approaches developed by Owkin data scientists for real-world medical problems across many therapeutic areas. Therefore, with these tools in hand, Owkin trained a federated learning model separately on each center’s dataset for reference. The next step was to successfully deploy Connect at each center to train the model collaboratively between both centers’ data in a federated manner.
It became clear that the already tricky medical question of predicting treatment response was incredibly difficult in a federated learning setting. The team needed further work to improve the performance of the federated learning and data science approaches. The team worked together with CLB and Institute Curie to enrich the data to increase the size of their local imaging datasets and pairing them with clinical data. They also plan more elaborate federated learning strategies this summer, enabled by Owkin’s software and tools’ flexibility. We will be excited to share the result with you later on in the year in an official announcement. You will hear from us soon.
The Healthchain project is supported by Bpifrance, which resulted from the “Digital Investments Program for the major challenges of the future” RFP. As part of the “Healthchain” project, a consortium coordinated by Owkin (a private company) was established, including the Substra association, Apricity (a private company), the Assistance Publique des Hôpitaux de Paris, the University Hospital Center of Nantes, the Léon Bérard Center, the French National Center for Scientific Research, the École Polytechnique, the Institut Curie and the University of Paris Descartes.
Our team at Owkin are grateful to Centre Léon Bérard and Institut Curie for their enthusiasm and dedication, in particular: Thierry Durand, Pierre-Etienne Heudel, Franck Mestre, Clément Joly, Charles Bongiorno, Julie Struyf, Jean-Yves Blay at CLB and Alain Livartowski, Guillaume Bataillon, Astrid Lang, Xosé Fernandez, Julien Guérin, Armand Léopold, Guillaume Arras, Timothé Cynober, Johan Archinard, You-Heng Ea at Institut Curie.
Owkin thanks all HealthChain partners and remain enthusiastic about the upcoming federated learning projects between CHU Nantes and AP-HP in onco-dermatology. Owkin is excited to bring federated learning to our growing network of top-tier medical research institutions, Owkin Loop.
Special thanks and congratulations to Owkin employees who made this project a great success: Mathieu Galtier, Camille Marini, Anne-Laure Moisson, Samuel Lesuffleur, Inal Djafar, Clément Gautier, Claire Philippe, Guillaume Cisco, Kelvin Moutet, Jérémy Morel, Thibault Robert, Aurélien Gasser, Maël Debon, Gilles Wainrib, Eric Tramel, Meriem Sefta, Etienne Bendjebbar, Amandine Lagorce, David Vallas, Adrian Gonzalez, Jocelyn Dachary, Julien Masson, Charles Maussion, Jean du Terrail.