Scaling Speech Lab Offline

Using Kubernetes and Redis to host and scale a 24/7 cloud AI application leads to optimal resource use

What is Speech Lab?

Speech Lab is a speech recognition engine developed in collaboration with National University of Singapore (NUS) and Nanyang Technological University (NTU). Its specialized code-switching functionality is able to transcribe multiple languages and dialects in a single conversation. This makes it well-suited for the local context and other communities in Asia where multiple languages are spoken. It can also be customized to suit a variety of industries which require speech to text for their specialized domains.

From Prototype to Production

Once the Speech Lab development team has successfully prototyped an offline service running on a Docker container, the next step was for us, the Data Engineering team at AISG, to package the service and deploy it in a production environment. This means scaling up the service as well as providing a demo platform for the public to access and try.

The original architecture of the prototype service is shown below.

Figure 1 : The original implementation of the prototype

As you can see, the pipeline is rudimentary and does not support multiple users.

The team set out to develop a new architecture with the following capabilities.

  • A service that runs 24/7 with minimum supervision.
  • Supports multiple speech-to-text models.
  • Able to scale up workers and process the workload in parallel when there are increasing workloads.
  • Notifies the front-end system upon the completion of processing.
  • Saves the results to a cloud storage.
  • Only utilizes the CPUs when there is a new request initiated from the front-end.

To realize this, we made use of Kubernetes and Redis.

Kubernetes is a popular open-source platform used to automate the management of applications running on a Linux cluster. Applications that are deployed on Kubernetes are able to leverage on its built-in features, like application health check, set restart policy and more.

Redis is an in-memory database. The application leverages on the lightning speed of Redis and uses it as a cache and queuing system.

Our Solution

After numerous discussions with the team members, we finally came up with a new architecture as shown below.

Figure 2 : The Kubernetes implementation

The new architecture runs on a cloud Kubernetes cluster. The cluster is configured with auto-scaling functionality. Based on the overall CPU usage, if it is above the threshold, a new node (VM/machine) is added automatically to the cluster pool. It will also automatically scale down the number of nodes when the workload is reduced.

Redis is used as the Ingress service for the front-end application. The front-end application pushes each audio file into the cloud storage and then triggers the job to run in the cluster by submitting a metadata payload through the Redis. Periodically, a listener reads from Redis and uses the Kubernetes API to create a job to run in the cluster.

Each job that runs on the cluster is created based on the speech-to-text model requested by the user. The model is selected and loaded from the cloud storage. This allows a model to be easily swappable with another speech-to-text model through the cloud storage user interface. Another reason for storing the models in the cloud storage is the high frequency of introducing new versions. With this implementation, there is no need to bring down the application to upgrade to a new speech-to-text model.

A job will only start executing when there are sufficient resources in the cluster. Otherwise, it is put into a pending state and waits until the resources are available. After completing the processing, a transcript file is created and saved to the cloud storage. A callback mechanism is used to notify the front-end applications that the job has been completed.

Each time a job completes a run, it will free up the resource. Deploying this architecture on the same production cluster which co-hosts other applications enables we to optimize the use of hardware resources.

By following industry standards in deploying containerized applications, Speech Lab Offline is able to minimize the cost of running workers 24/7.

Author