Technical Debt in Building Machine Learning Systems
Developing a machine learning (ML) system is not the same as software engineering, as I recently wrote. However, there is much to learn from the long history of software engineering. This is hardly surprising as the ML engineer writes code and must own and maintain it. One of the concepts which can be leveraged from software engineering is that of technical debt. Just like in the case of building software systems, costs are incurred when ML systems are developed and deployed (often under tight schedules) without well-thought-out processes and tools to maintain and update them. This eventually leads to disappointment in ML projects a year from deployment.
The term technical debt comes from the world of finance (obviously). Just like financial debts, a technical debt must eventually be repaid. And just like in the world of finance, a technical debt often results from an engineering decision made to incur a future cost in order to achieve a more profitable goal (e.g. time-to-market). This is all well and good. However, there are often times when technical debts are incurred not deliberately (the finance analogy breaks down here), but inadvertently. This is where things get messy. Studies into technical debt aim to bring to attention the cases where certain development pathways might lead to undesirable outcomes.
As the world entered an AI spring in the early 2010s, the concept of technical debt in software engineering began to be applied to machine learning. Google researchers in particular published a series of papers at NIPS (now renamed NeurIPS) in 2014, 2015 and 2016 addressing this topic. The papers are worth reading in full. If pressed for time, consider the 6-min read here and 18-min read here. An important takeaway point is that deferring the payment of such debts results in compounding costs. And such compounding happens silently.
The big challenge of technical debt is that it can come in many forms and there is no single silver bullet to address them. There is often not even an objective way to measure it. The qualitative questions proposed by the Google researchers in their papers, which a development team should continually ask themselves, help to keep the topic not far from mind as the race to develop and deploy complex ML systems tends to crowd out concerns of lesser immediacy.