Managing Data and Infrastructure in AI Projects: Key Challenges and Solutions
In a previous article, I shared about the challenges faced in the area of project scope while managing AI projects. In this article, I am going to share some of the key challenges faced, in the area of data and development environment, and how they were addressed by the project managers and the project teams.
DATA
Data is of great significance to AI/ML projects. It is not possible to train a ML model to achieve acceptable results without the right quantity and quality of data. In this regard, another main challenge faced by the project team is the delay in the provision of quality data by the project sponsor, and sometimes made worse due to non-availability of quality data.
Data preparation is time consuming and project sponsor has to invest considerable time and effort to curate data for model training. For some projects, the data needs to be scraped from public domain websites which requires the project sponsor to build a web scraper to extract the data. Most project sponsors are not prepared with the data required for model development at the start of the project. There were cases where the data was only made fully available around midway of the project. This delays the Exploratory Data Analysis (EDA) planned in the early stage of the project. EDA is meant for the project team to uncover any data quality issues early and provide feedback to the project sponsor to take next steps to improve the data quality.
Challenges faced
- Insufficient data for model training
- Poor data quality
- Imbalanced dataset
- Incorrect labelling/annotation
- Full dataset is available only around midway of the project
- Synthetic data provided instead of actual data due to regulatory requirements
How we dealt with data issues
- Reiterated the need for good quantity and quality of data (for model accuracy) during the initial proposal review by the presales team and during the project kick-off meeting and subsequent sprint review meetings by the project team.
- Relevant pretrained models were employed for cases where the quantity of training data is not sufficient.
- Provided templates with samples on how the data has to labelled and the quantity of labels required.
- Arranged meetings with the labelling team to understand the manual labelling process and communicated the labelling requirements to the labellers.
- Shared open source annotation tools and organized briefing sessions with the labelling team on how the data has to be annotated for better model performance.
- For projects where the project sponsor could provide only synthetic data due to regulatory requirements, the project team worked with the project sponsor to ensure that the synthetic data closely matches with and is representative of the actual data. Also, the model which is trained on the synthetic data had to be retrained on the actual data of good quality and further fine-tuned to get better results.
DEVELOPMENT ENVIRONMENT
The development of ML model requires a development environment with all the required hardware (e.g. GPU, memory, storage), open source ML libraries (e.g. TensorFlow, PyTorch), experiment tracking tools (e.g. MLFlow, Weights and Biases) and other resources for training and building the models. In AISG, the Platforms team has provisioned a comprehensive development environment which enables smooth development and deployment of the projects. The environment has since evolved over time based on the requirements for the execution of the past projects.
However, there were projects where the project sponsor expected the project team to use the organisation’s infrastructure and development environment for building the ML models, mainly due to the regulations and restrictions in data handling and management, i.e. the data required for model development has to stay within the corporate networks of the organisation. In such a case, the project team had to access the environment remotely and faced additional challenges due to the unexpected and unpredictable limitations of the environment which hindered the development speed.
Challenges faced
- Time and effort taken to apply for the remote access accounts
- Restricted access to the Internet to download the required ML libraries
- Lengthy approval process which may take 2-3 weeks for installing any new ML libraries and Python packages which are not available in the software repository of the organisation
- Software installations in the development environment were wiped out when the developer desktop was refreshed once every 3 weeks
How we dealt with development environment issues
- Proactively raised the requests for new ML libraries early so that the libraries are available when needed for model development and testing.
- To lessen the effect of the delays in getting approval to install the required ML libraries, the project team conducted the experimentations of some of the models in AISG development environment using synthetic data and subsequently transferred the model code to the project sponsor’s environment for testing the model with the actual data.
- Moved some of the software installations to the persistent folders to minimize the time spent in reinstalling the wiped out software.
CONCLUSION
As the field of AI continues to evolve, these lessons in managing data and development environment challenges will remain as pillars of guidance for the AI project managers of the future.