MLOps Transformation: Moving from Stage 0 to Stage 3 (Part I)

Much a cultural shift as a technical one.

Find out how we improved our Machine Learning Operations (MLOps) maturity from stage 0 to 3 with our in-house Skills Extraction Model.
Image Source: https://ml-ops.org/content/mlops-principles

Skills Extraction Algorithm

The Skills Extraction Algorithm (SEA), developed by the Research and Development Pillar of SkillsFuture Singapore’s (SSG) National Jobs-Skills Data Office (NJSDO), utilizes machine learning to identify skills in a given course or job posting based on SSG’s established skills taxonomy.

This skills extraction is critical for aligning workforce capabilities with evolving industry demands. By automating the identification of standardized skills from job postings and course descriptions, SEA enhances the skills identification process using a common skills language. This supports more targeted identification of skill gaps, and enables downstream skills recognition, curriculum development, and strategic workforce planning, just to name a few.

The need for Machine Learning Operations (MLOps)

Without MLOps, there are struggles to keep models up-to-date as skillsets and job requirements rapidly evolve. The manual process of re-training, re-evaluating, and re-deploying models hampers agility, while a lack of automated workflows and monitoring makes it difficult to maintain a healthy, high-performing model in production.

In most machine learning projects that made it to production, we need to consider a few points:

  1. Model updates are never a one-time task. Models need to be updated regularly for many reasons: new data inputs, changes in business definitions, or to address performance degradation over time
  2. Lack of model monitoring often leads to poor follow-up. Without proper monitoring in place, it’s difficult to detect when models degrade, making timely updates nearly impossible. This also leads to lack of trust and transparency.
  3. Setup model logging and feedback loop. Closely related to monitoring, these are crucial for reproducibility, debugging, and reliability. Logging provides records, while data from feedback loop enables continuous model improvement through retraining.
  4. Lack of an automated workflow to ensure there is always a working model deployed. This often happens because engineers and data scientists work in silos, with no clear accountability on who is responsible for keeping the model healthy.

In our case, the need to continually re-train and re-deploy models is even more critical to keep up with rapidly evolving skillsets and job requirements. Typically, each update requires a data scientist to manually re-train and re-evaluate the model before re-deploying it, which could be handled by a different person. As more feedback data comes in, this manual model updating and re-training process will slow down agility, making it harder to implement changes and deploy updates.

As a result, the pace at which new models and updates are rolled out is restricted, preventing the system from adapting swiftly to the changing needs of employers and job seekers. In fact, updating a model can take any time from a day and up to a week, due to the need for re-extraction, pre-processing, training, evaluation, deployment, and stakeholder updates.

This inefficiency not only slows down decision making but also increase the risk of inconsistencies across model updates, and risk missing out new skills in the skills extraction process. As industry demands shift rapidly, delays in model updates and deployment means that the insights extracted could be outdated, reducing the effectiveness of the system.

Hence, to overcome these challenges, there is a critical need for a streamlined, automated approach, and this is where MLOps comes in.

What exactly is MLOps?

“Machine learning operations (MLOps) MLOps is a practice that streamlines the development and deployment of ML models and AI workflows.” — Microsoft

MLOps refers to the practices and tools designed to simplify and automate machine learning deployment, monitoring, and management of models in production. It covers the entire machine learning lifecycle, from model development to deployment and maintenance, prioritizing collaboration, reproducibility, and scalability. This approach helps bridge the gap between machine learning development and operational deployment, fostering the creation of robust, scalable, and maintainable systems.

By providing a set of practices designed to automate and standardize the entire machine learning lifecycle, from model development to deployment, MLOps eliminates many of the manual processes that hinder agility. Through automation, MLOps allows organizations to quickly adapt to new data, ensure continuous model improvement, and maintain high-quality, reliable systems.

Stages of MLOps maturity and our goals

When adopting MLOps practices, especially in Whole of Government (WOG), there are helpful references that outline the stages of maturity in this journey. One example is the the AI Practice MLOps playbookwhich breaks down the MLOps journey into step-by-step stages, which makes it easier for teams to gradually scale up their MLOps practices, especially those with relatively new engineering setup. Or, for teams with existing software engineering practices in place, Microsoft’s MLOps Maturity Model, a recognized benchmark that outlines key principles and practices for advancing machine learning systems can also be considered.

Before starting our MLOps journey, it is important to assess our current stage (back then) against a recognized benchmark to understand where we stood. For instance, Microsoft’s MLOps Maturity model or using AI Practice’s MLOps playbook, which aligns well with our agency’s setup, where both benchmarks hinted our maturity stage to be at 0. At this stage, all processes were manual, lacking proper documentation (e.g., Confluence) and version control. Model re-training was also conducted manually without performance tracking or experiment result logging.

We aimed high to reach MLOps stage 3 to achieve automation, and having a high-quality Continuous Integration (CI) and Continuous Deployment (CD), to ensure that our ML systems are efficient, resilient, and continuously improving while minimizing manual effort and operational risks. Specifically, we aim to progress to stage 1 (guided by the MLOps playbook) and, as we mature, adopt MLOps principles aligned with stage 3 and beyond.

Road to MLOps Maturity Stage 3

This initial workflow lacked automation in tracking model performance, experiment results, and version control, leading to inefficiencies and isolated development processes. Collaboration between teams was minimal, and tasks like model retraining and performance monitoring were done manually, without proper tracking of key metrics.

Recognizing the challenges and aiming to improve, we set our sights on achieving MLOps maturity stage 3 and divided our objectives into two stages.

  • Stage 1: Focus here will be on model build and model deploy. Automate core processes such as model training, deployment (with human approvals for UAT and PROD), and implement basic CI/CD. ETL processes will be required here to ensure data flows into the machine learning model development environment. This stage should bring us to a maturity stage between 1 and 2.
  • Stage 2: Focus here will be on automated testing and performance monitoring. We will implement performance monitoring, feedback loops, and set up triggers for model updates and re-training. Additionally there should be full automated testing (e.g: Unit testing of key functions, testing for out-of-bound values, individual components, etc) and a proper test environment. This should bring us to around maturity of almost 3.

Note: In our case, full automation will always include a need for manual approvals from DEV to keep user acceptance testing (UAT) and production (PROD). This is to ensure better control and safety in the deployment process, mitigating risks associated with automated deployment in a production environment.

Overview of our MLOps

The overview of our MLOps process at MLOps maturity stage 3 can be seen below. In this setup, we primarily leveraged the MAESTRO platform for our MLOps system, AWS RedShift, GitLab, and Tableau for dash-boarding. As of this writing, we have successfully achieved MLOps maturity stage 3. (Technical details of each component will be elaborated and explain in the next article.)

Note: MAESTRO (Machine Learning & Artificial Intelligence Enterprise-level Secure Tool Suite for Reliable Operations) is a centralized Whole-of-Government (WOG) data platform that provides a comprehensive range of tools and services, along with scalable compute resources.

Overview of MLOps Process

Model Build and Model Deploy should bring you somewhere between stage 1 and 2. Automated Testing and Model Monitoring will bring you to around stage 3.

Fast forward to today, being placed at MLOps maturity stage 3 has had a significant impact on our workflow and overall efficiency. With the implementation of automated pipelines, performance monitoring, and model retraining triggers, we have streamlined the model development process, reducing the manual effort previously required. This has led to faster iterations (8 man hours per release), more consistent deployments (same endpoint and processes), and improved collaboration across teams. Additionally, real-time monitoring allows us to proactively address any performance issues, ensuring that our models remain effective in production. This has not only enhanced the technical reliability of our models but has also fostered a culture of continuous improvement and alignment with organizational goals.

With all these changes, key differences are summarized below.

Concluding remarks

In this article, we introduced MLOps and its significance. By advancing from maturity stage 0 to stage 2, we’ve streamlined workflows, boosted automation, scalability, and efficiency. Enhanced model management, version control, and performance tracking have minimized manual effort and increased productivity.

In our next article, we will dive deeper into the technical specifics of each MLOps component, providing a detailed breakdown of the architecture, processes, and tools involved in achieving a fully automated, scalable, and efficient MLOps system. Stay tuned!

Acknowledgements

Special shoutout to Daryl Low and Eugene Chua from SSG for their contribution for SEA.

Special thanks to Leo Li, Raymond Harris, and Lois Ji from AI Practice for driving this MLOps adoption in SSG.

Special thanks to Victor Ong from AI Practice for his guidance and feedback

Special thanks to Teng Fone from MAESTRO team for providing guidance to make all of this possible.