Building Platforms and Products at AI Singapore
As one of the original four engineers at AI Singapore, Najib Ninaba has left a deep imprint on the way things are run here. His current role in leading Platforms and Product Engineering continues to provide ample opportunities to define the way projects are delivered in the future. All this and more in my conversation with him.
Below is a transcript of the conversation [*].
Hi Naj, great to have you with us today.
Hi Basil. Good to be here.
Naj, you play a very important role here in AI Singapore, but before we come to that, could you share with us your personal journey in the earlier part of your career?
Sure. After I finished my full-time national service, I joined a Linux company. I was always interested in doing Linux, even back in secondary school. I came across this book about the X Window system. Back then we were using Windows 3.1. I didn’t really like the way it looked and the way Microsoft was doing things, so I started to dabble in doing Linux. That was how I got into doing it all the way into my polytechnic. I felt quite confident doing Linux systems.
There was an opportunity after national service. I saw this Linux company in Singapore called eLinux Systems and I went in for an interview. Laurence Liew, who is our Director for AI Innovation in AI Singapore was the hiring manager and he found that I was suited for a junior role and I joined as a junior system engineer. It turned out that eLinux Systems wasn’t just doing Linux systems for the office backend and things like that. Laurence was deep into high performance computing (HPC) in Singapore and he threw me into the deep end of doing high performance computing and we went on to build some of the first generation clusters in Singapore. This was way back in early 2000 – 2001. So, some of the first-generation high performance clusters in NUS, NTU and several other institutes might have been built on those clusters.
Back in 2000 – 2001, the dot.com bubble burst and eLinux Systems was disbanded, but Laurence still valued the engineering team that was part of eLinux, so he brought us into Singapore Computer Systems (SCS), and still doing HPC contracts and things like that. Doing HPC was similar to how cloud systems are right now. We were managing a lot of systems, clusters and racks of systems working as one. The setting up back then was almost unmanageable I would say, because you are talking about one rack containing like thirty servers. I actually had to bring a boot disk server by server and it got very tiring and boring.
I went around looking for an open source Linux toolkit to manage high performance clusters and I came across this HPC provisioning toolkit called Rocks which came out of the San Diego Supercomputer Center (SDSC) and I played around with it and it worked really well. That was one of my first forays into becoming an open source project committer. We found the toolkit was lacking certain components, particularly the packaging of a HPC job scheduler, and Laurence encouraged me to contribute this work back to the SDSC team. They liked it well enough to get me to be part of their so-called core committer group. This was just three guys in San Diego Computer Center plus myself. They had thought that somebody out of California would join them, but never did they expect somebody from Singapore, from Bedok to join and become a core committer, so that was fun. That became part of our software stack. It really drilled into me the importance of having a cohesive platform stack to be able to deploy such systems.
In Singapore, we began to do more and more of these HPC projects and within SCS there was an internal competition and we won an innovation award and we got some money. With that money, Laurence and myself stepped out of SCS and co-founded a start-up based on high performance computing called Scalable Systems. You see elements of it even now at AI Singapore where the software stack is built on top of open source and we deliver value on top of it. This was like a true start-up adventure – late nights of coding, working with a small engineering team. Myself, I had to do both project management and engineering management. We continued to work very closely with the San Diego Supercomputer Center folks. We went to the US several times, we went for a supercomputing conference, a big HPC conference back then. We even had a US presence there and the SDSC team supported us really really well.
By 2006, we caught the eye of another HPC focus group that came out of Platform Computing based in North America. They made an offer to acquire us. After looking into what the acquisition really meant for us, talking to our SDSC collaborators, we made the move. That was my first understanding of what it means for a start-up to be acquired. Leading to that acquisition, Laurence and I went up to Toronto in Canada and we met with the CEO and CTO and talked about things we would be doing. I become like an engineering manager within Platform Computing. So that was my first foray into engineering management.
Platform Computing back then was pretty big. It had offices in Europe as well in Asia. Within Asia they had development teams in Beijing and Shanghai and they were looking at us in Singapore to be like the anchor point for businesses around Asia. They gave us a budget and we built up a very strong engineering team. One of the interesting things was how we actually interviewed people into the engineering team. We set up a half-day workshop and got people to come in and work with the team. That became the blueprint for how we are doing the AI Singapore AI Apprenticeship Programme where we invite interested folks to join in. So that blueprint came from way back then.
I left two years before Platform Computing got acquired by IBM. If I had stayed, it would have been the second time I experienced an acquisition. But before I left I made sure the team leads I was mentoring could step up and deliver. So that was also another thing that we’ve brought into AI Singapore. We understand the strong need to have really good mentorship. Even down to the junior engineers, the strong sense of mentoring, being team leads. The engineering team acquired by IBM, many moved on to some of the bigger tech companies out there – Red Hat, Google and HP.
I went out and did my own freelance consulting. Laurence stayed on with IBM and after a while he reached out to me again. He said, hey let’s do another start-up and we went on to do a data science and analytics company called Revolution Analytics Singapore. There was a Revolution Analytics in the US and they wanted to set up an office here. So, we tried to replicate the process, what we did in Platform Engineering, building up the engineering within Revolution Analytics. That was where we worked with folks like William (Tjhi), a (team) head within AI Singapore.
Again, Revolution Analytics got acquired by Microsoft to be their advanced analytics arm. The team I was managing wasn’t as big as in Platform Computing, but it was fun and very talented. This was back in 2012, 2014. It was then I realised that data science and big data was on the rise. I knew there was something going on, so much momentum going on there. So, it was a short stint. I was there for about 2012 to 2014, two years and after that, I stepped out to do my own consulting on side. I also did some freelance work with the San Diego folks I used to work.
Again, in 2016 Laurence reached out to me and told me he was stepping out of Revolution Analytics. At that point, he also realised that data science and big data and AI were starting to be something quite important here in Singapore. So we co-founded a new start-up called Real Analytics, focusing on data analytics and data engineering. We are really running lean – just a two-man operation, myself and Laurence. We did a lot of consulting work, all the way to even in Malaysia, where Microsoft approached us to conduct training on their big data stack for Telekom Malaysia. So we actually spent some time in Malaysia doing consulting work.
Around 2017, we got a training contract with NUS SCALE (School of Continuing and Lifelong Education). They got us on board to deliver a training on data analytics and data engineering and I taught a few courses around reproducible data science, data engineering, chatbots and it was pretty fun. At that time, Laurence also put in motion the seeds of AI Singapore. AI Singapore, as you know, got started in 2017. Laurence reached out to me and I came on board in January 2018.
A very exciting journey spanning both hardware and software, across Asia and North America, working in both start-up culture as well as in big organisations, in engineering and in management, delivery code and delivering training. So, with this vast experience that you are bringing into this current chapter here in AI Singapore, I think it’s going to be equally exciting. I had the good fortune of joining AI Singapore late in 2018. What was it like, even before that?
Right, so I joined in January 2018. Then it was lean. I was one of the original four engineers within AI Singapore. We started with four engineers – myself, William, Jean and Maurice. And Ken, the originator for TagUI, joined soon after. It was pretty challenging. Doing AI was new to our customers and collaborators. For us, the engineering team, the tooling was still very rough, I must say. The machine learning problem statements we saw back then were, to me, still data science in disguise, not so much on machine learning, but we did have some project sponsors with really innovative use cases.
We were four engineers, Ken was focusing on TagUI, and Laurence realised something very early where to work on one hundred projects, we really needed good AI-trained engineers, especially Singaporean engineers because some of the project sponsors require them. If we could not find them, we would go and groom and train these folks. So that was how AIAP came about. And because we had experience building engineering teams, we brought those principles in. Like how to how to attract, identify and hire a talent.
One of the things I want to highlight is that back in the days, AI Singapore wasn’t where it is now. Now we are situated in innovation 4.0, but before that we were in NUS UTown in the same building as where NRF was. Then when innovation 4.0 was ready we moved there and we really then had the space to get apprentices in, to house them and train with them. When we did the first batch, it was only thirteen folks, if I’m not wrong. This was a learning experience on both side, both for the AI mentors, which were the four of us and the AIAP folks who joined. They sat with us within our office and we had to juggle both mentoring the folks solving the problem statements, as well as continuing innovating on our things, so that was really challenging, but it went well and we had some really good and supportive project sponsors with us on this journey. I can say that we really started to scale once we started doing AIAP batches 2 and 3. Basil, you came out from batch 2, right? I’m sure you can appreciate how back then we were a bit rough. I think only after some of the batch 2 folks started to join us, that was really when there was some kind of momentum and we went very fast to where we’re at now. So, I think that was quite amazing, we had that thing going on for us.
I want to highlight, the infrastructure used to support this. Back then when it was just the four of us, it was really out of our laptops. We did have some workstations and servers, but we pretty much utilised everything under Microsoft Azure as our cloud infra. But we realised, doing deep learning, that it was not going to be very cost-effective to continue doing this. That was when we went ahead and built up our own on-premise infrastructure. Again, building on my past experience of building HPC clusters, we put up a tender and, together with my platforms team, we set up this AI HPC cluster that is running even now. And then we actually have a mix of infrastructure, both on the cloud, as well as on-premise. We commissioned that cluster back in 2019.
Yeah, as you mentioned, I was part of the journey from batch 2 onwards, so I certainly witnessed this taking-off phase. I was aware of some of the challenges that came along. Could you maybe go a little deeper and share with our listeners?
Sure. Some things never change. Back then, some of the 100E problem statements tended to be more like data science, some of them going towards deep learning, doing CV applications .. but the lack of data back then and even now is still a problem. Project sponsors coming in wanting to do machine learning projects with us, but then they realised that – and we realised that – you know, there wasn’t enough data. That was a challenge. I think often times our project teams had to wait or simulate or build up their own data set until the project sponsor provided that data set. More often it was really the lack of data stopping us. So that was one big challenge.
Another thing is, as we moved from batch to batch and more sophisticated problem statements came about, we started going into more deep learning projects. They require more resources around GPUs. This was what led to building up our on-premise AI cluster to support. So, even now where we actually have quite a fair bit of on-premise infrastructure to do this type of things, we are still having not enough accelerators to run this type of projects, so that’s something on the roadmap. We actually want to acquire more stuff, hardware accelerators to run these types of projects.
We also see that the old ways of doing, at least from a tech stack, is evolving. Back then, you had data sitting in a server, your typical big data setup. But for AI you need really lots of data and processing power to develop your end models. So, the old ways of doing clusters and doing servers, I won’t say not relevant, but there are more ways of doing infrastructure that have come up. Particularly around Kubernetes, containers and container orchestration. So there’re a lot of these modern infrastructure. In fact, there’re too much of these infrastructure tooling now, you’re spoilt for choice. This is where over at our platforms team, we come up with our own stack and we regularly get consulted on the infrastructure and deployment stuff for each of these projects. Often times, we say, okay go for this type of default stack and go for this type of well-proven techniques of deploying your models …
They are becoming more informed …
Yes, project sponsors these days are getting more cloud infra-savvy also, which is a good thing. If you remember back then when I set up the cluster, it was by hand … but now, with the advent of cloud infrastructure, it can also be a bit of challenge. Your data is here, your compute is there, how do you manage all this sort of things, but now we have all this modern infra tooling, it makes things a little easier.
As an insider, I’m also aware of some of the in-house platforms that we are building – like Kapitan Workspaces, Kapitan Scout and Kapitan Intuition. What are the problems that they are intending to solve?
As we worked with several project teams, we found that it was getting a bit more challenging to support because some of them were running different sets of tooling for different projects and when we had the AIAP, we don’t provide laptops so they had to come with their own – bring your own device – but they still access our infrastructure, whether on-premise or on the cloud. We recognised the fact that there needs to be consistent tooling and interface to make sure they are able to work on their project statements regardless of the state of their hardware – no hardware discrimination. The thought of having a consistent development environment actually spoke to us really really well. Back then, doing HPC, when researchers worked on these clusters, what they typically did was to SSH in onto the server, then they had their own directory where they were able to install stuff. So, we brought that idea in and led to what we call Kapitan Workspaces, where we provide consistent development environments to the users and all they need to do is, first access our infrastructure and beyond that they just need a SSH client as well as a modern browser. Once you have those, they are able to just log in, spin up the Kapitan Workspaces and they are able to provision Jupyter notebook or JupyterLab. They are able to do visual code in the browser and do their work. All of this infrastructure is being backed by our powerful AI cluster. So, even with a Chromebook, so long you have a modern browser, so long you have SSH, so long you have the VPN to access our stuff, you are good to go. And we took it a step further, at least for VS code users. They have this plugin called SSH remote. They can still continue to use their visual code editor, but they are able to just SSH into our infrastructure and leverage our powerful infrastructure backend, but still utilising their local setup. That’s basically Workspaces, providing a consistent tooling and access to our cluster.
For Kapitan Scout, it came about when we saw that between different projects, there were some issues in terms of how we go about deploying the models. As you know, our AIAPs come from various backgrounds. Some may not necessarily have a computer science background or computer engineering background, but they are really strong in machine learning, statistics .. and then we have on the other side, folks really strong in computer science but not in statistics… Kapitan Scout is basically our stack for .. once your model has been trained, bring it into Kapitan Scout which is powered by technologies such as Seldon Core, Grafana, Prometheus … it provides a consistent interface to bring your model in, like a consistent API endpoint, where you can also check the health of that model, you can also do some deployment strategies like A/B testing, multi-armed bandits scenarios in a consistent manner. So long you have the necessary CI/CD snippets within your tooling of choice, once you have copied that snippet in and then you put your models in a well known manner, the backend will call Scout to package the model as a Docker container and deploy into an API endpoint and then you are able to monitor the metrics, not just model metrics, but also the operational metrics like how many requests are coming in. That’s what Scout aims to solve, to help our project teams deploy their models more easily and robustly.
Kapitan Intuition is an internal tooling more for our AI infrastructure, because we have a lot of infrastructure – we have Microsoft Azure, our on-premise infra – we needed a way to monitor all of them in a consistent manner and do things like – because we are AI Singapore, so we want to be able to use AI to help manage and operate our stuff – Intuition aims to solve the problem where we start to do things around predictive analytics, around which servers are going down. Before they go down, can you give me a window based on the past historical context, what services are likely to go . So, that’s what Intuition aims to do …
Like a predictive maintenance problem …
Right, we want to evolve to a point where we can do preventive maintenance, where before the things even go down, highlight the operators … so Intuition right now is being built for our internal infrastructure, but we’re also building our own data centre together with NUS and NSCC, so we are hoping to actually bring Kapitan Intuition to manage from a data centre context, so that’s gonna be pretty exciting. Definitely, we’re looking for folks to help us along in the journey.
Wow, so many things mentioned, but I think we still have one more, which is Brickworks Gallery. This one I think is pretty exciting, at least to me, because it talks about the actual projects that we have done. Could you tell us what is it about?
Sure. Brickworks Gallery came out from my product engineering team – I actually manage two things, one is platforms engineering and the other one is product engineering. Product engineering is really meant to build tooling to enable and accelerate the building of machine learning apps. We come out with best practices and various toolkits. One of the first projects that came out of this product engineering group is Brickworks Gallery. This came about because we have up to now seven batches of AIAP already and each batch worked on a set of 100E and internal projects. The AI engineer mentors who are attached move on from batch to batch, from project to project. We realised that there is knowledge that is not stored as they moved from project to project – the papers they worked on, the techniques, the algorithms, the demos – some of them we are recording because we have a great project management where the assets of that project are being recorded, so we know what they are, but from an engineering team point of view there isn’t an easy way to figure out, hey have we done something around CV, doing object recognition .. or have we done things around predictive maintenance .. previously these folks had to go ask around who did that in that project, and often is like on memory and you can forget about certain things, so we realised that’s a gap. What we did was we created an internal portal where we extract certain project information that engineers can see, and then we also provide an interface where the mentors themselves can update, like the citation references, what papers they have read, what techniques they have used, and pointers to the GitLab repos of the project, demos and presentations and marketing they have done in a systematic manner.
So, it’s like a knowledge capture kind of tool, right?
It’s a knowledge base essentially. We have basically catalogued every 100E project from batch one all the way to now and we have automated some of the processes. Once a project has ended, there is a backend process that extracts information back into the Brickworks Gallery and we just need the mentors to come in and fill in some missing gasps, but even that we are looking into ways to automate that in the future. So, it’s quite a cool project and something that I believe is quite appreciated within the engineering team.
Ya, considering the number of projects that we have already delivered, this certainly justifies building up such a gallery of projects for reference. Well, at the risk of sounding our own trumpet at AI Singapore, I think this is all very great work and what else can we expect in the future?
I think there’s gonna be a lot more focus on MLOps and AI ethics. That’s gonna be a lot more things now, so we will be infusing a lot more MLOps engineering within our 100E projects. What that means is basically we want to engage our project sponsors earlier to figure out, do they actually have the resources once the project is ended and they have taken over, make sure they have the necessary systems in place, and if not then, that’s when we can intervene and help them with that. And we are also doing new initiatives. Some of us, including myself, are part of the AI technical committee where we are driving new standards. A lot of this is driven by what we’ve seen so far in the past three years of doing 100E projects. We’ve seen a lot of problem statements, we’ve seen the challenges, so I’m hoping to bring that in a more consistent manner, at least from the AI committee as a whole they are able to benefit.
We are also doing something that’s actually quite new, that’s coming out of product engineering, where we are also focusing on user experience best practices for machine learning. What that basically means is, we take a look at common things, like how do you do model training, how do you do data preparation in the workflow. We have various tooling, by open source projects, by commercial vendors, but when you start building a machine learning infused application, you often need to build your own UI. But as I mentioned, our engineers and apprentices come from diverse backgrounds, so they may not have the necessary grounding in best practices around UX, so we are coming out with things like Brickworks Facade. Basically, it is a set of best practices and principles around UI design, around user experience for things like how do you do model training, how do you do model deployment, to make sure that the applications they deliver follow this type of human-first best practices in user experience.
In terms of platforms, as I mentioned, we are building our own data centre together with NSCC and NUS. We are getting more diverse hardware, more AI accelerators, so we will be beefing up our number of GPU counts to the delight of our project teams. We are also looking into other AI accelerators like TPUs, FPGAs, all the new processors coming out from Nvidia and other folks, Intel and AMD. We’ll be getting more of those types of hardware. We also have expanded our cloud infrastructure. Before that, we were using Microsoft Azure, and we are continuing to using Microsoft Azure, but now we also use GCP – Google Cloud Platform – so we now have two cloud infrastructures for our users. Two cloud infrastructures and expanded on-premise infrastructure are what the engineering team and our AI Singapore collaborators can expect.
So, from the experience that we have gained over the last three years or so working with different industry verticals on real-world problems, you’ve gotten to understand the actual needs when it comes to implementation and deployment – what works well, what doesn’t – and so we have responded with solutions to meet those needs. Thanks a lot for today’s sharing.
No problem. Happy to be here.
[*] This conversation was transcribed using Speech Lab. The transcript has been edited for length and clarity.