Owkin ConnectProduct Guide

What is Owkin Connect?

Owkin Connect is our proprietary Federated Learning (FL) software that powers collaborations between hospitals, research centers, technology partners and life science companies in a privacy-preserved way. Our distributed architecture and federated learning capabilities allow data scientists to securely connect to decentralised, multi-party datasets and train AI models without having to pool data.

Owkin Connect is based on Substra, an open-source software framework developed by Owkin for orchestrating distributed machine learning tasks in a secure way. Substra is based on the distributed ledger technology Hyperledger Fabric. This forms the heart of Owkin’s fully transparent and non-forgeable traceability platform.

Book a Demo

The 3 Components of Owkin Connect

Users can interact with Owkin Connect in three different ways:

A book
A Python Library
Picture depicting network
A Command Line Interface
Monitor
A Graphical User Interface

1. The Owkin Connect Python Library
Ideal for Data Scientists, Data Engineers

Owkin Connect’s Python library enables users to launch federated computations, programmatically register and manage assets (datasets, algorithms, metrics, etc.) on the platform and analyze results. Any Python code and dependency can operate on the platform in the form of a self-contained docker container, therefore data scientists are able to use any machine learning libraries (Tensorflow, PyTorch, Scikit-learn etc.).

The Owkin Connect library is shipped with extensive documentation and tutorials.

2. The Command Line Interface (CLI)
Ideal for IT-Admin

One can also use Owkin Connect with a Command Line Interface (CLI). This is particularly useful to monitor training tasks or to list assets registered on the platform. The CLI and the Python library provide the same functionalities, therefore users are free to use the tool with which they feel the most comfortable.

The Owkin Connect Command Line Interface

3. The Graphical User Interface (GUI)
Ideal for Project Manager, Data Scientist, Data Engineer

Owkin Connect’s Graphical User Interface shows an overview of the network: how many nodes are there, and which compute plans are running on them. We define compute plan as a set of training, aggregation and testing tasks gathered together towards building one or multiple final models. The users can follow the  progression of the compute plans. Then, once a model training is complete, they can review the history of the compute plan, step by step, as well as visualise the performance of their models.

Through the interface, the traceability features of the software allow to dive into all the assets that have been already registered.

Users can follow the progression of the compute plans. Then, once a model training is complete, they can review the history of the compute plan, step by step, as well as visualise the performance of their models.

How to use it?

Step 1: Register Data

To be robust and performant, Owkin Connect makes it possible to train predictive models on heterogeneous data to generalize well to different patient populations, treatment plans, and different data modalities such as histology slides and genomic data. Consequently, it is possible to register any type of data to the platform: tabular data, imaging, histology, -omics, text,  videos, etc. 

Data needs to be copied into an Owkin Connect node. Connections between any database and Owkin Connect can be set up using Python code.

Data can be registered either using the CLI or the Python library.

Step 2: Harmonize Data

Data harmonization between nodes is a major challenge in federated learning, especially in a blind setting where partners of a network cannot access or visualize the datasets of the other partners. 

If data scientists have remote access to every node individually (i.e. they can not move the data but can connect to a server where the data is), they can then look at the data and train local models before training a federated model.

If they don’t have remote access to every node, data harmonization becomes a challenging task but still completely feasible (it has been done in different projects led by Owkin, including MELLODDY). What can help them:

  • Data template: data scientists know that every dataset follows the same template.
  • Synthetic data: data engineers can produce synthetic data that follow the same template but that can be shared with data scientists.

Step 3: Deploy Federated Learning Strategies

Owkin Connect’s Python library is equipped with ready-to-run federated strategies like Federated Averaging, Cyclic or Random Walk. Users can easily design their own federated strategies, but Owkin Connect’s Python library is equipped with ready-to-run federated strategies such as:

  • Federated Averaging: A round consists in a performing a predefined number of forward/backward passes on each client, aggregating updates by computing their means and distributing the consensus update to all clients.
  • Cyclic: A round consists in performing a predefined number of forward/backward passes on each client, in a sequential fashion: on the first client, which sends the resulting model to the second, and so on until the last client.
  • Random Walk: A model is iteratively trained on a random sequence of clients. A round consists in performing a model update on a single random client and then transferring the updated model on all sites.

Step 4: Launch & Execute the Compute Plan

The graphical user interface of Owkin Connect allows to track model training steps, with performance indication at each aggregation. Once a model training is complete, users can review the history of the compute plan, step by step, as well as visualise the performance of the federated model vs local individual models.

Users can later download models trained on the platform if all the participants (data and algorithm providers) give permission.

Architecture Overview

From an architecture point of view, Owkin Connect is deployed on standard computing hardware, on a Kubernetes cluster. As such, multiple deployment scenarios are possible, depending on the available infrastructure & the possible approaches to data management.

The application layer is deployed on a Kubernetes cluster for each partner. It can be deployed on-premise or on the cloud (AWS, GCP, Azure, etc). Note that it is possible to consider a hybrid deployment architecture with some nodes on-premise and others on the cloud. 

More precisely, Owkin Connect is made of:

  • A backend (in pink in the figure above) that manages data storage and computation
  • A “Federated Learning orchestrator” and distributed ledger (in blue in the figure above) that are responsible for the orchestration of the distributed tasks within a deployed network. 

When it is possible to get access to data, a data scientist environment is also deployed. It makes it possible to spawn ML workspaces for data scientists.

Key Strengths of Owkin Connect

Privacy

With Owkin Connect, data is never shared, only the algorithm & the model weights travel. Measures to protect the privacy of the algorithm/models include:

  • Traceability of access to algorithms and temporary models
  • Advanced memory management: the algorithm is only sent for a specific task and immediately deleted from the node after the task has finalized (experimental)
  • Use of trusted execution environments & secure enclaves: we are currently reviewing this relatively new technology and planning an integration when the technology matures (in particular regarding compatibility with GPU computations)

Algorithms, including feature engineering & pre-processing, will always run on the node where the data is stored. Their code does not have to be shared in human legible format or made accessible to conventional users. All-access & execution of the code is registered.  However, privacy is preserved only in any local execution setting (binary obfuscation, appropriate memory management). 

Additional privacy for high-risk settings is achieved by additionally encrypting model updates locally (as in MELLODDY for example).

Traceability

Traceability is at the core of Owkin Connect, to ensure trust between all the partners and enable reproducibility.

To ensure this traceability, all operations on the platform are written to an immutable ledger based on Hyperledger Fabric. This makes some operations non-reversible, such as data registration or model training metadata recording – an encoded trace of these operations will persist unless the platform is reset. This traceability information is non-identifiable when dealing with patient data.

This distributed ledger contains only non-sensitive metadata:

  • anonymous identifiers of assets on the network, including datasets
  • associated permission for using assets, including datasets
  • specification of training, aggregation and evaluation tasks, constituting compute plans. This guarantees traceability and reproducibility.

Security

Security is in the DNA of Owkin. We are currently preparing for the ISO 27001 certification, which demonstrates that we follow the best practices in terms of security. More particularly, the platform has gone through extensive security audits (penetration testing, review of source code) by external companies specialized in cybersecurity.  User permissions & roles are of the utmost importance to guarantee the security of the platform and safeguard the highest levels of privacy. Every asset in Owkin Connect can be permitted independently to any user.  

Advanced encryption is applied throughout the platform to provide additional security:

  • Model updates: model’s updates themselves can be encrypted to ensure that a central aggregator node, needed in some FL strategies, cannot access sensitive data. This technology is called secure aggregation.
  • Data: data is encrypted at rest
  • Network communications: all network communications are encrypted using standard encryption procedures and libraries. Communications between nodes are of two types: 
    • Secure GRPC protocol: for non-sensitive metadata exchange between orchestrators. It is encrypted with TLS. 
    • HTTPS Rest API: for communication between node backends (models, models updates) 

Through our unique technology, including advanced permission management & traceability features, Owkin Connect is a fully GDPR and HIPAA compliant solution for running data science projects on anonymised or pseudonymized patient data.

How to deploy Connect in a real setting?

Owkin Connect can be either deployed in the cloud or on-premise. Owkin can ensure the complete management of the platform so that the partners can simply use it as a Software as a Service (SaaS), meaning that Owkin manages the setup of the Kubernetes cluster and the deployment of the application layer, as well as the support for these two layers. Depending on the use case, one or the two layers can also be managed by clients. We provide tools to easily deploy the application layer. 

Given the variety of use cases, Owkin proposes different deployment and maintenance levels:

  • Managed deployments: Owkin Connect is deployed and managed by Owkin, on the client’s preferred infrastructure.
    • Available on both cloud and on-premise deployments
  • Self-hosted deployments: Owkin Connect is deployed and managed by the client on its preferred infrastructure. Owkin provides tools to easily deploy Connect, support services, and can optionally host and operate a central aggregator node if necessary.
    • Available on both cloud and on-premise deployments

Additional services are available to the client, including premium service level agreements (for support and maintenance), custom integration with other software tools, account management, back-ups and other functions.

Glossary

  • Aggregation: One way to do federated learning is to train models locally on each partner dataset and to then aggregate these models, by averaging the weights (or the weights updates) of each model.
  • Compute Plan: A compute plan corresponds to a set of training, aggregation and testing tasks gathered together towards building one or multiple final models.
  • Data Template: A data template is a template that can be used to make sure different datasets are compatible with one another and can therefore be used in a federated learning project.
  • Distributed Ledger Technology: A distributed ledger (also called a shared ledger or distributed ledger technology or DLT) is a consensus of replicated, shared, and synchronized digital data spread across multiple sites, countries, or institutions. Unlike with a distributed database, there is no central administrator
  • Encryption: In cryptography, encryption is the process of encoding information. This process converts the original representation of the information into an alternative form. Only authorized parties can go back from the encrypted form to the original form
  • Federated Learning: Federated learning (also known as collaborative learning) is a machine learning technique that trains an algorithm across multiple decentralized edge devices or servers holding local data samples, without exchanging them. This approach stands in contrast to traditional centralized machine learning techniques where all the local datasets are uploaded to one server, as well as to more classical decentralized approaches which often assume that local data samples are identically distributed.
  • Machine Learning: Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as “training data”, to make predictions or decisions without being explicitly programmed to do so. Machine learning algorithms are used in a wide variety of applications, such as email filtering and computer vision, where it is difficult or unfeasible to develop conventional algorithms to perform the needed tasks.
  • Machine learning algorithm: A procedure that outputs/creates a model based on datasets and optionally other models. More concretely, in most cases, it is made of:
    • one or several model architectures
    • an optimisation algorithm
    • a stopping criteria
    • a loss function
  • Metadata: Metadata summarizes basic information about data, making finding & working with particular instances of data easier. In Owkin Connect, metadata is used to provide information (ownership, permission, etc) about assets (dataset for instance) without revealing the actual data.
  • Nodes: Nodes are standalone computing and storage resources running the Connect code. Nodes can be connected to form a network. Each node belongs to a unique entity/organization. 
  • Secure Aggregation: The problem of computing an aggregation where no party reveals its update in the clear—even to the aggregator

About the authors

Camille Marini, PhD – Chief Technology Officer, Owkin

As Chief Technology Officer of Owkin, Camille currently leads the development of a collaborative machine learning platform designed for data scientists and medical experts. Before joining Owkin, Camille was an academic researcher and worked in startups focused on reproducible and traceable AI. She graduated from Mines ParisTech and completed her PhD in applied data science at the Université Pierre et Marie Curie.

Romain Goussault, Product Manager

Romain graduated from Mines ParisTech. Before joining Owkin, he worked as a software developer in Sydney. Then he came back to France to be a research engineer at IFPen and then a data scientist in Nantes hospital. He joined Owkin as a Product Manager.