Leveraging Kedro in 100E
Introduction
At AI Singapore, projects under the 100 Experiments (100E) programme delivering machine learning applications are usually staffed by engineers of different academic backgrounds and varying levels of experience. As a project technical lead and mentor to apprentice engineers within the programme, my first priority in each project is often to establish a foundation of practices that would both form an efficient, unified development workflow and provide opportunities to impart some engineering wisdom to the junior engineers. In the never-ending journey to address these challenges, I have found Kedro to be a useful tool, serving as a step-by-step guide for developing machine learning products, instilling sound software engineering principles, and ensuring production-level code.
The following is a summary of how I utilised Kedro in my 100E projects. For a full rundown of its features or instructions on how to set up and configure Kedro for your projects, refer to the official documentation.
What is Kedro?
Kedro is a Python library created by the boffins at QuantumBlack. It is an opinionated development workflow framework for building machine learning applications, with a strong focus on helping developers write production code. It seeks to provide a standardised approach to collaborative development by providing a template to structure projects, as well as by implementing its own paradigm for data management and parameter configuration. It also comes with a command-line interface to automate the execution of unit tests, generation of documentation, packaging into libraries, and other processes. This suite of features and its intended workflow encapsulates what its creators consider to be the ideal approach to machine-learning projects and Kedro is the realisation of it.
The Kedro Workflow
Workflows differ from team to team. A team comprising diverse professional backgrounds will inevitably have some degree of dissonance with regard to approaching work. Those with an affinity for analysis may be inclined to devote more time doing exploration and prefer a more fluid manner of problem solving. Those who enjoy building applications may prefer to work on a well-defined set of features and focus on productising. In building machine learning products, both approaches have their merits and Kedro seeks to harmonise the two with its workflow. I found that members from both camps took to this workflow readily because it is simple and structured.
The Kedro workflow can be summarised into three simple steps:
- Explore the data
- Build Pipelines
- Package the project
1. Explore Data
Projects typically begin with exploration of the given data and experimentation of viable models. There are a myriad ways to solve a problem in machine-learning — they depend on the available data, the features derived from that data, and the compatibility of the model to those features; they also differ in complexity, reliability and execution.
The first step to building an application is to address this ambiguity through an iterative process of statistical analysis, hypothesis testing, knowledge discovery and rapid prototyping. Generally, this is done through Jupyter notebooks, since the preference is on having a tight feedback loop and notebooks provide immediate visualisation of outputs. Kedro streamlines this process by providing a structured way to ingest and version data through its Data Catalog and integrates this into Jupyter.
As the name implies, the Data Catalog is a record of data in the project. It provides a consistent structure to define where data comes from, how it is parsed and how it is saved. This allows data to be formalised as it undergoes each stage of transformation and offers a shared reference point for collaborators to access each of those versions, in order to perform analysis or further engineering.
By ensuring all team members work on the same sets of data, they can focus on the objectives of data exploration: discovering insights to guide decision making, establishing useful feature engineering processes and assessing prototype models. Furthermore, this assurance on consistency also allows for easy synchronisation between individual work, since everyone references the same Data Catalog when reading and writing data.
2. Build Pipelines
Data exploration is meant to be rapid and rough — code written at this stage usually spawn from experiments and rarely survive the rigours of quality control. Hence, what follows exploration is a step to selectively refine processes that have utility and which ought to be implemented in the final product, and to formalise them as modules and scripts. Kedro has defined its own structure for processes, consisting of Nodes and Pipelines.
Nodes are the building blocks of Pipelines and represent tasks in a process. Each one is made up of a function, as well as specified input and output variable names. Pipelines are simply series of Nodes that operate on data sequentially and all data engineering processes should be consolidated into distinct Pipelines. Use of this structure allows for automatic resolution of dependencies between functions, so that more time can be spent on other aspects of refining code: ensuring it is reliable, maintainable and readable; and writing unit tests and documentation.
The pipeline building process is iterative and forms the bulk of development work. At any point during exploration, when any process has been deemed useful enough to be reused or contribute to the finished product, there should be a concerted effort to implement the experimental code as a pipeline. This ensures quality by preventing dependence on non-production code, accelerates collaboration by only sharing and using reliable code, and forms a practice of steadily contributing features to the final product. This process of exploration and refinement is repeated until the product is fully formed.
3. Package the Project
The output of a complete project is a repository of source code. While it may have met the technical objectives, it is not exactly suited for general use. The final stage of the workflow is to bundle the project into a Python package that can be either delivered as a user-friendly library, or integrated into an application framework to be deployed and served. Kedro has built-in automation for packaging. As long as development complies with the Kedro workflow and structure, very little tinkering is required at this stage. Kedro also has automatic generation of API documentation, as well as extensions that allow projects to be containerised and shipped.
Kedro Standards
Kedro’s features and its workflow were intended to establish software engineering best practices — aspects of the project deemed essential to enhancing collaboration and ensuring production-level code.
The following are the most significant practices from my experience:
- Data Abstraction & Versioning
- Modularisation
- Test-Driven Development
- Configuration Management
1. Data Abstraction & Versioning
Data often undergo a variety of transformations during experimentation. Each data set derived from these transformations may have its own separate purpose and may be required to be accessed at different times or by different people. Furthermore, different members of the team may perform their own independent engineering and analysis, and would therefore produce their own versions of data. The aforementioned Data Catalog offers a means to selectively formalise each of these versions, and consolidate them into a central document with which collaborators can reference and contribute. With it, data can be shared and reproduced by different team members, simplifying organisation and ensuring there is a cohesive system to track data.
2. Modularisation
Modularisation is a staple of software engineering. It encompasses the deconstruction of code into more independent units, reducing complexity and improving maintainability. Kedro encourages modularisation by imposing the use of its Node and Pipeline structure. Nodes are required to be defined as pure functions, with specified input and outputs. When combined into a Pipeline, data is easily passed from one Node to the next. This especially prevents the construction of ‘god functions’ which are commonly found in experimental code.
3. Test-Driven Development
Test-driven development (TDD) establishes a feedback loop that ensures an application’s functional needs are met, promotes development of high-quality code and reduces operational risks through proactive testing. Pure TDD is difficult to achieve when building machine learning applications because it needs software requirements to be defined before features are built, but requirements in machine learning projects are typically nebulous and morph throughout development. Regardless, Kedro comes with pytest built in and its project template is structured to allow tests to be written and recognised by it.
4. Configuration Management
Through the project template and automated commands, Kedro also provides the means to manage configuration of experiment parameters, logging and credentials. Not only does it abstract away the tedium of redefining each variable every time it changes, it also enables the separation of private information from the shared repository. This is a crucial security measure when developing code that will eventually be deployed. Kedro ensures that security principles such as this are implemented from the get-go.
Summary
I found Kedro to be overall a useful tool for developing machine-learning applications, especially for facilitating collaboration and establishing good development practices. The various means in which it automates and abstracts away the usually tedious organisational tasks enables the team to concentrate on more creative work. Using Kedro made building features straightforward, as it made sure that every time an experiment yields a promising snippet, it would be assimilated into the codebase. I feel Kedro is fully capable of achieving its goal of helping developers produce production-level code, since its multitude of features ensure that code is tested, refined and reliably integrated.
There is one drawback, however, which is that Kedro’s Nodes and Pipelines structure must be strictly adhered to. This presents a problem when writing code that is intended to fit into a specification that deviates from this pattern. For instance, deep learning projects that utilise other libraries have their own separate framework and that would take precedence over Kedro’s data engineering framework. Considerable time must then be spent on ensuring any shared objects can be passed between the two paradigms. Fortunately, it is not too difficult to extract code from Kedro’s structure and implement your own. Kedro is a development workflow framework after all, so once the majority of development work is done, the features can be refactored to fit other frameworks. That way, Kedro’s convenient offerings can still be exploited. Regardless, I would still rely on Kedro to be a helpful guide to building production-level machine learning models.