Owkin Connect product guide
How does Owkin Connect work?
Owkin Connect is our proprietary federated learning (FL) software that powers collaborations between hospitals, research centers, technology partners and biopharma companies in a privacy-preserved way. Our distributed architecture and federated learning capabilities allow data scientists to securely connect to decentralized, multi-party datasets and train AI models without having to pool data. Owkin Connect is based on Substra, an open-source software framework developed by us for orchestrating distributed machine learning tasks in a secure way. Substra is based on the distributed ledger technology Hyperledger Fabric. This forms the heart of Owkin’s fully transparent and non-forgeable traceability platform.
Discover how data scientists can train their machine learning models on decentralized data using Owkin Connect.
The three components of Owkin Connect
01. Ideal for Data Scientists and Data Engineers:
The Owkin Connect Python library
01. Ideal for Data Scientists and Data Engineers:
Owkin Connect’s Python library enables users to launch federated computations, programmatically register and manage assets (datasets, algorithms, metrics, etc.) on the platform and analyze results. Any Python code and dependency can operate on the platform in the form of a self-contained docker container, therefore data scientists are able to use any machine learning libraries (Tensorflow, PyTorch, Scikit-learn, etc.).


The Owkin Connect library ships with extensive documentation and tutorials.
02. Ideal for IT admin
The Command Line Interface
02. Ideal for IT admin
One can also use Owkin Connect with a Command Line Interface (CLI). This is particularly useful to monitor training tasks or to list assets registered on the platform. The CLI and the Python library provide the same functionalities, therefore users are free to use the tool with which they feel the most comfortable.

The Owkin Connect Command Line Interface (CLI).
03. Ideal for Project Manager, Data Scientist, Data Engineer:
The Graphical User Interface
03. Ideal for Project Manager, Data Scientist, Data Engineer:
Owkin Connect’s Graphical User Interface shows an overview of the network: how many nodes are there, and which compute plans are running on them. We define compute plan as a set of training, aggregation and testing tasks gathered together towards building one or multiple final models. The users can follow the progression of the compute plans. Then, once a model training is complete, they can review the history of the compute plan, step by step, as well as visualize the performance of their models. Through the interface, the traceability features of the software allow you to dive into all the assets that have been already registered.

Users can follow the progression of the compute plans. Then, once a model training is complete, they can review the history of the compute plan, step by step, as well as visualize the performance of their models.
A how-to-guide to Owkin Connect
Step 01:
Register data
Step 01:
To be robust and performant, Owkin Connect makes it possible to train predictive models on heterogeneous data to generalize well to different patient populations, treatment plans, and different data modalities such as histology slides and genomic data. Consequently, it is possible to register any type of data to the platform: tabular data, imaging, histology, -omics, text, videos, etc. Data needs to be copied into an Owkin Connect node. Connections between any database and Owkin Connect can be set up using Python code.


Data can be registered using either the CLI or the Python library.
Step 02
Harmonize data
Step 02
Data harmonization between nodes is a major challenge in federated learning, especially in a blind setting where partners of a network cannot access or visualize the datasets of the other partners. If data scientists have remote access to every node individually (i.e. they can not move the data but can connect to a server where the data is), they can then look at the data and train local models before training a federated model. If they don’t have remote access to every node, data harmonization becomes a challenging task but still completely feasible (it has been done in different projects led by Owkin, including MELLODDY). What can help them? A data template so that data scientists know that every dataset follows the same template. Data engineers can also produce synthetic data that follows the same template to share with data scientists.
Step 03:
Deploy federated learning strategies
Step 03:
Users can easily design their own federated strategies, and Owkin Connect’s Python library is equipped with ready-to-run federated strategies such as:
Federated Averaging
A round consists in performing a predefined number of forward/backward passes on each client, aggregating updates by computing their means and distributing the consensus update to all clients.
Cyclic
A round consists in performing a predefined number of forward/backward passes on each client, in a sequential fashion: on the first client, which sends the resulting model to the second, and so on until the last client.
Random Walk
A model is iteratively trained on a random sequence of clients. A round consists in performing a model update on a single random client and then transferring the updated model on all sites.
Step 04:
Launch and execute the compute plan
Step 04:
The graphical user interface of Owkin Connect allows you to track model training steps, with performance indicators at each aggregation. Once a model training is complete, users can review the history of the compute plan, step by step, as well as visualize the performance of the federated model vs local individual models. Users can later download models trained on the platform if all the participants (data and algorithm providers) give permission.
Step 05:
Architecture overview
Step 05:

From an architecture point of view, Owkin Connect is deployed on standard computing hardware, on a Kubernetes cluster. As such, multiple deployment scenarios are possible, depending on the available infrastructure and the possible approaches to data management.
The application layer is deployed on a Kubernetes cluster for each partner. It can be deployed on-premise or on the cloud (AWS, GCP, Azure, etc). Note that it is possible to consider a hybrid deployment architecture with some nodes on-premise and others on the cloud.
More precisely, Owkin Connect is made of:
A backend (in pink in the figure above) that manages data storage and computation
A “federated learning orchestrator” and distributed ledger (in blue in the figure above) are responsible for the orchestration of the distributed tasks within a deployed network.
When it is possible to get access to data, a data scientist environment is also deployed. It makes it possible to spawn ML workspaces for data scientists.
Core strengths
Core strength 01:
Privacy
Core strength 01:
With Owkin Connect, data is never shared, only the algorithm and the model weights travel. Measures to protect the privacy of the algorithm/models include:
Traceability of access to algorithms and temporary models
Advanced memory management: the algorithm is only sent for a specific task and immediately deleted from the node after the task has finalized (experimental)
Use of trusted execution environments & secure enclaves: we are currently reviewing this relatively new technology and planning an integration when the technology matures (in particular regarding compatibility with GPU computations)
Algorithms, including feature engineering and pre-processing, will always run on the node where the data is stored. Their code does not have to be shared in human legible format or made accessible to conventional users. All access and execution of the code is registered. However, privacy is preserved only in any local execution setting (binary obfuscation, appropriate memory management).
Additional privacy for high-risk settings is achieved by additionally encrypting model updates locally (as in MELLODDY for example).
Core strength 02:
Traceability
Core strength 02:
Traceability is at the core of Owkin Connect, to ensure trust between all the partners and enable reproducibility. To ensure this traceability, all operations on the platform are written to an immutable ledger based on Hyperledger Fabric. This makes some operations non-reversible, such as data registration or model training metadata recording – an encoded trace of these operations will persist unless the platform is reset. This traceability information is non-identifiable when dealing with patient data. This distributed ledger contains only non-sensitive metadata:
anonymous identifiers of assets on the network, including datasets
associated permission for using assets, including datasets
specification of training, aggregation and evaluation tasks, constituting compute plans. This guarantees traceability and reproducibility
Core strength 03:
Security
Core strength 03:
Security is in the DNA of Owkin. We have the ISO 27001 certification, which demonstrates that we follow the best practices in terms of security. More particularly, the platform has gone through extensive security audits (penetration testing, review of source code) by external companies specialized in cybersecurity. User permissions and roles are of the utmost importance to guarantee the security of the platform and safeguard the highest levels of privacy. Every asset in Owkin Connect can be permitted independently to any user.
Advanced encryption is applied throughout the platform to provide additional security:
Model updates: Models update themselves and can be encrypted to ensure that a central aggregator node, needed in some FL strategies, cannot access sensitive data. This technology is called secure aggregation.
Data: data is encrypted at rest.
Network communications: all network communications are encrypted using standard encryption procedures and libraries. Communications between nodes are of two types:
Secure GRPC protocol: for non-sensitive metadata exchange between orchestrators. It is encrypted with TLS.
HTTPS Rest API: for communication between node backends (models, model updates).
Through our unique technology, including advanced permission management and traceability features, Owkin Connect is a fully GDPR and HIPAA compliant solution for running data science projects on anonymized or pseudonymized patient data.
How to deploy Connect in a real-world setting?
Owkin Connect can be deployed in either the cloud or on-premise. We can ensure the complete management of the platform so that the partners can simply use it as software as a service (SaaS). This means that we manage the setup of the Kubernetes cluster and the deployment of the application layer, as well as the support of these two layers. Depending on the use case, one or two layers can also be managed by clients. We provide tools to easily deploy the application layer.


Given the variety of use cases, Owkin proposes different deployment and maintenance levels:
Managed deployments
Owkin Connect is deployed and managed by Owkin, on the client’s preferred infrastructure. Available on both cloud and on-premise deployments.
Self-hosted deployments
Self-hosted deployments: Owkin Connect is deployed and managed by the client on its preferred infrastructure. Owkin provides tools to easily deploy Connect, support services, and can optionally host and operate a central aggregator node if necessary. Available on both cloud and on-premise deployments.
Additional services are available to the client, including premium service level agreements (for support and maintenance), custom integration with other software tools, account management, back-ups and other functions.
Glossary
Aggregation
Compute plan
Data Template
Distributed ledger technology
Encryption
Federated learning
Machine learning
Machine learning algorithm
Metadata
Nodes
Secure Aggregation
Take a closer look