A Peek into Synergos – AI Singapore’s Federated Learning System

In a previous post of this series, we touched upon the basics of Federated Learning and its benefits. We also mentioned that AI Singapore is working on building a system to support Federated Learning.

In this post, we will take a closer look at the system that AI Singapore is building. The system is named Synergos. This is a Greek word, from which the English word “Synergy” was derived. It means “to work together” or “to cooperate”, which is the very gist of the vision that Federated Learning promises. We will first talk about Synergos’ architecture and its various key components. After that, we will zoom in on one of the main components, i.e. Federation.

Key Components of Synergos

Synergos is essentially a distributed system, in which different parties work together to train a machine learning model without exposing the data of each individual party. The diagram below shows a single-party view of Synergos’ key components.

Single-party view of Synergos‘ key components.

We will start at the bottom of the diagram and work our way up.

The core of Synergos is its Federation component. Here is where the coordination among different parties to train a global model (without exposing data) happens. The Federation component defines the application level protocol over WebSocket to form a Federated Grid. A Federated Grid is a star-architecture network formed by different parties, who exchange messages among themselves to coordinate the model training and inference. We will take a closer look at Federation later.

Compute & Storage acts as an interface to different compute and storage backends. As a start, Synergos currently assumes that the data is managed by a file system and the compute load is handled by a single CPU node. Support for other storage services and compute frameworks is in the roadmap.  

In Synergos, as is typical in machine learning, multiple experiments are run to train multiple models, and one of them is eventually chosen as the model to be deployed into production. Different experiments are usually configured with different training datasets, model types, and/or hyperparameters. Model Lifecycle Management is responsible for tracking the running of multiple experiments to record and compare results. It also serves as a model registry to manage the lifecycle of a federated learning model, including model versioning and stage transitions.

As mentioned earlier, a Federated Grid is where the federated training really happens. In Synergos, this is not a persistent setup. It is typically destroyed when an experiment is finished. To run multiple experiments, Orchestration starts multiple Federated Grids and configures them with different sets of data and hyperparameters. The running of experiments are then tracked by Model Lifecycle Management. When all experiments are completed and a model is elected to transit to the production stage, Model Serving makes sure the model is up and running and is able to receive requests from the users, including those who did not contribute data and join the Federated Grid to train the model.

Contribution Calculation and Reward are closely related. One of the main value propositions of Federated Learning is that it enables collaborative model training without the individual parties exposing its training data. But this is a double-edged sword. It also opens the door for the “free-rider“, i.e. participants who try to benefit unilaterally by deliberately injecting dummy data into the training process. A contribution and reward mechanism could help to find out who are the potential free-riders so that the collective benefit of all the participants could be optimised. Contribution Calculation is responsible for evaluating the value of each party’s data; and Reward calculates how much gain a party could receive from the data it has contributed.

In Synergos, although different parties do not expose data to one another, they still need to “register” their data to the data catalog system (external to Synergos). This is accessible by all parties, so that they could identify what data are made available by other parties. Meta-data Management acts as the interface to the data catalog system, which exposes a number of APIs for actions like add/modify/delete data, registration and search. Experiments and model artefacts are also registered to the data catalog system.

Finally, the Dashboard provides a one-stop view of all the information generated by the different components, including experiments and their corresponding configurations (e.g. data used and hyperparameters) and performance. It could also be used to complete some administrative tasks, e.g. start/stop of an experiment, changing of models’ stage (e.g. election of a model to be deployed into production), etc.

Now that we have an overview of the key components inside Synergos and how they interact with one another, let us take a closer look at Federation.  

Zoomed-in View of Federation

Federation is developed on top of PySyft, a Python library for secure and private Deep Learning developed by OpenMined, an open-source community actively promoting the adoption of privacy-preserving AI technologies.

The best way to avoid data privacy violations is to not work with the raw data itself. Instead, we need to find a masked representation of the dataset, one that ensures an individual’s anonymity, but not at the cost of reduced algorithmic coverage. This is because in federated learning, we are not interested in the patterns found in individual sub-samples of reality. What we want is to derive aggregate trends that are generalizable to all parties in the system.

PointerTensor

The main vehicle used by PySyft to make data “private” is its PointerTensor. As its name implies, it creates an abstracted reference pointing to remote datasets. And this reference can be used by a third party to execute computations on the data without actually “seeing” the data.

In this example, jake is a Worker in PySyft. When we send a tensor to jake, we are returned a pointer to that tensor. All the operations will be executed with this pointer. This pointer holds information about the data present on another machine. Now, x is a PointerTensor, and it can be used to execute commands remotely on jake’s data. An analogy to better understand PointerTensor is that it works like a remote control, i.e. we can use it to turn on/off a TV without physically touching the TV.

Federated Grid

The PointerTensor is a powerful tool in making the data “private”. Nevertheless, it is at such a low level of abstraction, it is mandatory for developers to write their own coordination code before PointerTensor becomes operationally usable. And this is where Synergos’ Federation component comes in to help.

The Federation component defines the application level protocol over WebSocket to form Federated Grids. In Federation, parties who agree to work together would form a Federated Grid. A Federated Grid is a star-architecture network, in which different parties exchange messages among themselves to complete the model training and inference. The messages among different parties are exchanged via WebSocket protocol. The Federation component also exposes a number of REST APIs, which can be used to send commands to the different entities within the Federated Grid, e.g. start the training, destroy the various Workers (explained in the next paragraph) when federated training completes, etc.

Workers and TTP

There are two main types of roles in a Federated Grid. The first role is the Worker. Each party who contributes data would instantiate a worker. Individual workers do not expose their data to other Workers, but only pass their data to the TTP or Trusted Third Party, which is at the centre of the star architecture, solely responsible for coordinating the federated learning. The TTP contributes no data, but it has the “remote controller” to the data of the Workers‘.

Project, Experiment, Run

Hierarchy of project, experiment, and run in Synergos.

Before we proceed further, let’s understand some naming convention used in the Federation component. First, is a concept called Project. A project defines the common goal that multiple parties are working together to achieve. Under a project, there will be multiple experiments, each of them corresponds to one particular type of model to be trained, e.g. logistic regression, neural network, etc. And there are multiple runs under each experiment, each of them uses a different set of hyperparameters.  

Let’s use an example to better understand the relationship among different concepts. Assuming that multiple banks decide to work together to build an anti-money laundering model, this would define a project. Under this project, logistic regression is one type of model to be built. So an experiment will be defined to train a logistic regression model. Assuming we are using regularized logistic regression, multiple runs would then be defined with different values of the hyperparameter 𝛌.

A Federated Grid is setup for each run, which has three phases – Registration, Training, and Evaluation. Let’s visit them one by one.

Registration

The Registration phase is for all the parties to register the necessary information. The TTP, being the coordinator of a project, will define the project. It will also define the experiment and run, setting the model type and its corresponding hyperparameters. If a party is interested in working with other parties, its worker will register its participation in the project defined by the TTP. The party also needs to supply its connection information. After a worker has been registered into a project, it is able to declare data tags corresponding to the datasets that it would like to contribute within the project’s context. All this information is stored and managed by the Meta-data Management component.

Training

In the Training phase, the Federated Grid defined in the Registration phase needs to be up and running before the federated training takes place. There are a few things happening to bring the Federated Grid up.

First, individual Workers are initialized. Each of them instantiates a PySyft WebsocketServerWorker (WSSW). The connection info supplied by the Workers in the registration phase is used by the TTP to poll their data headers for feature alignment. Feature alignment is a step to make sure different parties have the same number of features after applying one-hot encoding on the categorical features, without revealing the different Workers’ data (we will have another post to talk about the need for feature alignment. Stay tuned!).

The TTP then conducts the feature alignment. The alignments obtained are then forwarded to the Workers, which are used to generate the aligned datasets across all Workers. The aligned dataset is then loaded into each Worker’s WSSW when it is instantiated. It also opens up the Worker’s specified ports to listen for incoming WebSocket connections from the TTP.

Subsequently, for each Worker, the TTP instantiates a PySyft WebsocketClientWorker(WSCW), which is to complete the TTP’s WebSocket handshake with the Worker. When the handshake is established, the TTP’s WSCW can be used to control the behaviour of the Worker’s WSSW without seeing the Worker’s data. With this, a Federated Grid is established.

Now the federated training starts. The global model architecture is fetched from the experiment definition. Likewise, the registered hyperparameters are fetched from the run definition. Pointers to training data are obtained by searching for all datasets tagged for training (i.e. “train” tag). During the training, TTP uses its WSCWs, which are connected to different workers’ WSSW, to coordinate the training, i.e. sending losses and gradients among TTP and Workers to update the global model’s weights with FedAvg or FedProx.

Once training is done, the final global and local models are exported. The Federated Grid will also be dismantled. This is done by first destroying all WSCWs, closing all active WebSocket connections. The TTP then uses the connection information provided by the Workers once more to send termination commands over to the Workers via the REST API, which destroys their respective WSSWs and reclaims resources. Now the Federated Grid is dismantled, and a run completes.

Evaluation

In the evaluation phase, the Federated Grid defined in the registration phase is recreated with the necessary information stored in the Meta-data Management component. Instead of searching for training datasets, datasets with “evaluate” tag are sourced from the Federated Grid. The global model is switched to evaluation mode (i.e. no weight update is happening), and is used to obtain inference values across all retrieved data pointers. Once inference values corresponding to all Workers are obtained, they are stored local at each Worker. Subsequently, performance metrics are computed locally at each Worker and sent back to the TTP for aggregation and logging purposes.

After all this has been completed, the Federated Grid is dismantled again with the same mechanism as described in the training phase.

We hope that by now you have a good understanding of the various key components of Synergos and how the Federation component works. We are currently running an invited preview of Synergos to get early feedback of the development. If you are interested, please send us an email at synergos-ext@aisingapore.org with a description of the use case you have in mind.

In the subsequent articles in this series, we will continue to see how Synergos handles some key technical challenges in Federated Learning, e.g. non-IID data. We will also present some use cases developed with Synergos. Stay tuned.

The Federated Learning Series

The Data Engineering Series

Author